You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@toree.apache.org by Hitesh Shah <hi...@apache.org> on 2016/02/04 20:41:34 UTC

Questions regarding mapping of notebook to spark-kernel to spark contexts

Hello 

As far as I can find from various searches of the old docs on spark-kernel wiki and issues on JIRA, atleast for jupyter, it seems like each notebook has its own spark-kernel which in turns wraps its own spark context. I am curious about how this isolation works considering that folks have encountered problems when trying to hold multiple spark contexts within the same jvm/process. Can someone help shed some light on this? 

I also took a look at the zeppelin work-in-progress branch. This however seems to use a single spark kernel per interpreter ( the zeppelin folks with their current spark integration also use a single spark context across all notebooks ) which I believe would imply sharing of the single spark kernel/spark context across all zeppelin notebooks? Is this assumption correct? If yes, can anyone suggest how the zeppelin integration would likely need to change? Would it be possible to say have one spark-kernel client per notebook ( or re-use them as needed from a pool ) and somehow have each one talk to a different spark kernel instance? 

thanks
— Hitesh

Re: Questions regarding mapping of notebook to spark-kernel to spark contexts

Posted by Gino Bustelo <gi...@bustelos.com>.

Hitesh,

Each Toree kernel is its own JVM process. Like you state... it is not easy
to have multiple SparkContext in one VM. It's been done before by Ooyala
Job Server, but there was quite a bit of hackery to get that to work.

So in summary:

Each kernel is a standalone process. Each kernel holds a single Spark
Context.
A client application can connect to any of these kernels as they see fit.
In the case of Jupyter, each notebook has its own kernel, but that is just
because they decided to implement it that way. The kernel itself (the
process) can handle multiple clients if necessary.

If you want multiple "notebooks" sharing RDDs, then each of these notebooks
will have to communicate with the same Toree Kernel. Note, as Corey
mentrioned, there are limitations and race-conditions possible with this
approach.

As I understand, Zeppelin uses Thrift RPC between Client and Interpreter.
They should be able to support remote Interpreters (either separate process
or other host). That would be how I would recommend integrating Toree into
Zeppelin.

On Thu, Feb 4, 2016 at 2:03 PM, Hitesh Shah <hi...@apache.org> wrote:

> Thanks Corey.
>
> I understand that quite a few folks are looking to share the spark context
> to share RDDs across different notebooks. I was looking at this more from
> an isolation/security point of view to ensure that each user has its own
> spark context especially in the case where spark is being used with yarn.
>
> That said, to clarify on the multiple spark-kernels, is that all being
> done within a single process?
>
> thanks
> — HItesh
>
> On Feb 4, 2016, at 11:57 AM, Corey Stubbs <ca...@gmail.com> wrote:
>
> > Hello Hitesh,
> > In regards to one Spark Context per kernel:
> >
> > In general, the rule of thumb: One notebook/application -> One Kernel ->
> > One Spark Context
> >
> > There is no isolation of spark contexts within the kernel. You are, as
> you
> > said, only able to create one context per kernel. We do have APIs and
> > command line options to create the context if you do not want the one we
> > supply. Typically, the application consuming from the kernel will need to
> > implement access controls to isolate kernels from other users.
> >
> > It is entirely possible to connect multiple notebooks/applications to a
> > kernel and "share" the context. I use the term "share" lightly because
> the
> > interpretation of code is a serial process. This means users could
> > potentially block one another by running some long running command.
> >
> > Kind Regards,
> >
> > Corey Stubbs
> >
> >
> > On Thu, Feb 4, 2016 at 1:41 PM Hitesh Shah <hi...@apache.org> wrote:
> >
> >> Hello
> >>
> >> As far as I can find from various searches of the old docs on
> spark-kernel
> >> wiki and issues on JIRA, atleast for jupyter, it seems like each
> notebook
> >> has its own spark-kernel which in turns wraps its own spark context. I
> am
> >> curious about how this isolation works considering that folks have
> >> encountered problems when trying to hold multiple spark contexts within
> the
> >> same jvm/process. Can someone help shed some light on this?
> >>
> >> I also took a look at the zeppelin work-in-progress branch. This however
> >> seems to use a single spark kernel per interpreter ( the zeppelin folks
> >> with their current spark integration also use a single spark context
> across
> >> all notebooks ) which I believe would imply sharing of the single spark
> >> kernel/spark context across all zeppelin notebooks? Is this assumption
> >> correct? If yes, can anyone suggest how the zeppelin integration would
> >> likely need to change? Would it be possible to say have one spark-kernel
> >> client per notebook ( or re-use them as needed from a pool ) and somehow
> >> have each one talk to a different spark kernel instance?
> >>
> >> thanks
> >> — Hitesh
> >>
> >>
> >>
> >>
>
>

Re: Questions regarding mapping of notebook to spark-kernel to spark contexts

Posted by Hitesh Shah <hi...@apache.org>.

Thanks Corey. 

I understand that quite a few folks are looking to share the spark context to share RDDs across different notebooks. I was looking at this more from an isolation/security point of view to ensure that each user has its own spark context especially in the case where spark is being used with yarn.

That said, to clarify on the multiple spark-kernels, is that all being done within a single process? 

thanks
— HItesh 

On Feb 4, 2016, at 11:57 AM, Corey Stubbs <ca...@gmail.com> wrote:

> Hello Hitesh,
> In regards to one Spark Context per kernel:
> 
> In general, the rule of thumb: One notebook/application -> One Kernel ->
> One Spark Context
> 
> There is no isolation of spark contexts within the kernel. You are, as you
> said, only able to create one context per kernel. We do have APIs and
> command line options to create the context if you do not want the one we
> supply. Typically, the application consuming from the kernel will need to
> implement access controls to isolate kernels from other users.
> 
> It is entirely possible to connect multiple notebooks/applications to a
> kernel and "share" the context. I use the term "share" lightly because the
> interpretation of code is a serial process. This means users could
> potentially block one another by running some long running command.
> 
> Kind Regards,
> 
> Corey Stubbs
> 
> 
> On Thu, Feb 4, 2016 at 1:41 PM Hitesh Shah <hi...@apache.org> wrote:
> 
>> Hello
>> 
>> As far as I can find from various searches of the old docs on spark-kernel
>> wiki and issues on JIRA, atleast for jupyter, it seems like each notebook
>> has its own spark-kernel which in turns wraps its own spark context. I am
>> curious about how this isolation works considering that folks have
>> encountered problems when trying to hold multiple spark contexts within the
>> same jvm/process. Can someone help shed some light on this?
>> 
>> I also took a look at the zeppelin work-in-progress branch. This however
>> seems to use a single spark kernel per interpreter ( the zeppelin folks
>> with their current spark integration also use a single spark context across
>> all notebooks ) which I believe would imply sharing of the single spark
>> kernel/spark context across all zeppelin notebooks? Is this assumption
>> correct? If yes, can anyone suggest how the zeppelin integration would
>> likely need to change? Would it be possible to say have one spark-kernel
>> client per notebook ( or re-use them as needed from a pool ) and somehow
>> have each one talk to a different spark kernel instance?
>> 
>> thanks
>> — Hitesh
>> 
>> 
>> 
>>

Re: Questions regarding mapping of notebook to spark-kernel to spark contexts

Posted by Corey Stubbs <ca...@gmail.com>.

Hello Hitesh,
In regards to one Spark Context per kernel:

In general, the rule of thumb: One notebook/application -> One Kernel ->
One Spark Context

There is no isolation of spark contexts within the kernel. You are, as you
said, only able to create one context per kernel. We do have APIs and
command line options to create the context if you do not want the one we
supply. Typically, the application consuming from the kernel will need to
implement access controls to isolate kernels from other users.

It is entirely possible to connect multiple notebooks/applications to a
kernel and "share" the context. I use the term "share" lightly because the
interpretation of code is a serial process. This means users could
potentially block one another by running some long running command.

Kind Regards,

Corey Stubbs

On Thu, Feb 4, 2016 at 1:41 PM Hitesh Shah <hi...@apache.org> wrote:

> Hello
>
> As far as I can find from various searches of the old docs on spark-kernel
> wiki and issues on JIRA, atleast for jupyter, it seems like each notebook
> has its own spark-kernel which in turns wraps its own spark context. I am
> curious about how this isolation works considering that folks have
> encountered problems when trying to hold multiple spark contexts within the
> same jvm/process. Can someone help shed some light on this?
>
> I also took a look at the zeppelin work-in-progress branch. This however
> seems to use a single spark kernel per interpreter ( the zeppelin folks
> with their current spark integration also use a single spark context across
> all notebooks ) which I believe would imply sharing of the single spark
> kernel/spark context across all zeppelin notebooks? Is this assumption
> correct? If yes, can anyone suggest how the zeppelin integration would
> likely need to change? Would it be possible to say have one spark-kernel
> client per notebook ( or re-use them as needed from a pool ) and somehow
> have each one talk to a different spark kernel instance?
>
> thanks
> — Hitesh
>
>
>
>