You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by John Omernik <jo...@omernik.com> on 2015/03/03 16:10:26 UTC

Re: Shared Drivers

Hey all, just wanted to toss this out again.  With such an active mailing
list, things can get pushed down quickly.  I'd be really excited to
understand this better.

On Fri, Feb 27, 2015 at 9:50 AM, John Omernik <jo...@omernik.com> wrote:

> All - I've asked this question before, and probably due to my own poor
> comprehension or my clumsy way I ask the question, I am still unclear on
> the answer. I'll try again this time using crude visual aids.
>
> I am using iPython Notebooks with Jupyter Hub. (Multi-User notebook
> server). To make an environment really smooth for data exploration, I am
> creating a spark context every time a notebook is opened.   (See image
> below)
>
> This can cause issues on my "analysis" (Jupyter Hub) server as say the
> driver uses 1024MB, each notebook, regardless of how much spark is used,
> opens up a driver.  Yes, I should probably set it up to only create the
> context on demand, however, that will cause additional delay. Another issue
> is once they are created, they are not closed until the notebook is halted.
> Users could leave notebook kernels running causing additional wasted
> resources.
>
>
>
>
> ​
>
> What I would like to do is share context per user.  Basically, each user
> on the system would only get one Spark context. And all adhoc queries or
> work would be sent through one driver.  This makes sense to me, as users
> will often want Spark adhoc capabilities, and this allows them to sit open,
> ready for adhoc work, while at the same time, not be over the top in
> resource usage, especially if a kernel is left open.
>
>
> ​
>
> On the mesos list I was made aware of SPARK-5338 which Tim Chen is working
> on. Based on conversations with him,  this wouldn't actually completely
> achieve what I am looking for. in that each notebook would likely still
> start a spark context, but at least in this case, the spark driver would be
> residing on the cluster, and thereby be resource managed by the cluster.
> One thing to note here, if the deisgn is similar to the YARN cluster
> design, then my iPython stuff may not work at all with Tim's approach, in
> that the shells (if I am remember correctly) don't work in cluster mode on
> Yarn.
>
>
> ​
>
>
> Barring that though, (the pyshell not working in cluster mode), I was
> thinking drivers could be shared per user like I initially proposed, ran on
> the cluster as Tim proposed, and the shells still work in cluster mode,
> that would be ideal. We'd have everything running on the cluster, and we
> wouldn't have wasted drivers or left open drivers utilizing resources.
>
>
>
>
>
> So I guess, ideally, what keeps us from
>
> A. (in Yarn Cluster mode) using the driver in the cluster
> B. Sharing drivers
>
> My guess is I may be missing something fundamental here in how Spark is
> supposed to work, but I see this as a more efficient use of resources for
> this type of work.  I may also looking into creating some docker containers
> and see how those work, but ideally I'd like to understand this at a base
> level... i.e.  why can't cluster (Yarn and Mesos) contexts be connected  to
> like a Spark stand alone cluster context can?
>
> Thanks!
>
>
> John
>
> ​
>