You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Punya Biswal <pb...@palantir.com> on 2014/03/16 16:09:24 UTC

Separating classloader management from SparkContexts

Hi all,

I'm trying to use Spark to support users who are interactively refining the
code that processes their data. As a concrete example, I might create an
RDD[String] and then write several versions of a function to map over the
RDD until I'm satisfied with the transformation. Right now, once I do
addJar() to add one version of the jar to the SparkContext, there's no way
to add a new version of the jar unless I rename the classes and functions
involved, or lose my current work by re-creating the SparkContext. Is there
a better way to do this?

One idea that comes to mind is that we could add APIs to create
"sub-contexts" from within a SparkContext. Jars added to a sub-context would
get added to a child classloader on the executor, so that different
sub-contexts could use classes with the same name while still being able to
access on-heap objects for RDDs. If this makes sense conceptually, I'd like
to work on a PR to add such functionality to Spark.

Punya

Re: Separating classloader management from SparkContexts

Posted by Punya Biswal <pb...@palantir.com>.

Hi Andrew,

Thanks for pointing me to that example. My understanding of the JobServer
(based on watching a demo of its UI) is that it maintains a set of spark
contexts and allows people to add jars to them, but doesn't allow unloading
or reloading jars within a spark context. The code in JobCache appears to be
a performance enhancement to speed up retrieval of jars that are used
frequently -- the classloader change is purely on the driver side, so that
the driver can serialize the job instance. I'm looking for a classloader
change on the executor-side, so that different jars can be uploaded to the
same SparkContext even if they contain some of the same classes.

Punya

From:  Andrew Ash <an...@andrewash.com>
Reply-To:  "user@spark.apache.org" <us...@spark.apache.org>
Date:  Wednesday, March 19, 2014 at 2:03 AM
To:  "user@spark.apache.org" <us...@spark.apache.org>
Subject:  Re: Separating classloader management from SparkContexts

Hi Punya, 

This seems like a problem that the recently-announced job-server would
likely have run into at one point.  I haven't tested it yet, but I'd be
interested to see what happens when two jobs in the job server have
conflicting classes.  Does the server correctly segregate each job's classes
from other concurrently-running jobs?

>From my reading of the code I think it may not work the way I'd want it to,
though there are a few classloader tricks going on.

https://github.com/ooyala/spark-jobserver/blob/master/job-server/src/spark.j
observer/JobCache.scala
<https://urldefense.proofpoint.com/v1/url?u=https://github.com/ooyala/spark-
jobserver/blob/master/job-server/src/spark.jobserver/JobCache.scala&k=fDZpZZ
QMmYwf27OU23GmAQ%3D%3D%0A&r=kTrYN051orSRhyA6mqYxbjRIX%2BBCPm7thmzLC79vBeM%3D
%0A&m=FPFPeXJiBQNyIG6CREbwusGj2ZQn1K10JLVA7ZNTjxY%3D%0A&s=694260f3ba26cd5ed7
adff9956193622a06dd7b316cc288dc9c8dd356bb396e6>

In line 29 there the jar is added to the SparkContext, and in 30 the jar is
added to the job-server's local classloader.

Note all this PR related to classloaders -
https://github.com/apache/spark/pull/119
<https://urldefense.proofpoint.com/v1/url?u=https://github.com/apache/spark/
pull/119&k=fDZpZZQMmYwf27OU23GmAQ%3D%3D%0A&r=kTrYN051orSRhyA6mqYxbjRIX%2BBCP
m7thmzLC79vBeM%3D%0A&m=FPFPeXJiBQNyIG6CREbwusGj2ZQn1K10JLVA7ZNTjxY%3D%0A&s=0
3ce2711c63b039ff6ea09a592ea5f16ac287890bcb90d1bf5855ed968ecf815>

Andrew



On Tue, Mar 18, 2014 at 9:24 AM, Punya Biswal <pb...@palantir.com> wrote:
> Hi Spark people,
> 
> Sorry to bug everyone again about this, but do people have any thoughts on
> whether sub-contexts would be a good way to solve this problem? I'm thinking
> of something like
> 
> class SparkContext {
>   // ... stuff ...
>   def inSubContext[T](fn: SparkContext => T): T
> }
> 
> this way, I could do something like
> 
> val sc = /* get myself a spark context somehow */;
> val rdd = sc.textFile("/stuff.txt")
> sc.inSubContext { sc1 =>
>   sc1.addJar("extras-v1.jar")
>   print(sc1.filter(/* fn that depends on jar */).count)
> }
> sc.inSubContext { sc2 =>
>   sc2.addJar("extras-v2.jar")
>   print(sc2.filter(/* fn that depends on jar */).count)
> }
> 
> ... even if classes in extras-v1.jar and extras-v2.jar have name collisions.
> 
> Punya
> 
> From: Punya Biswal <pb...@palantir.com>
> Reply-To: <us...@spark.apache.org>
> Date: Sunday, March 16, 2014 at 11:09 AM
> To: "user@spark.apache.org" <us...@spark.apache.org>
> Subject: Separating classloader management from SparkContexts
> 
> Hi all,
> 
> I'm trying to use Spark to support users who are interactively refining the
> code that processes their data. As a concrete example, I might create an
> RDD[String] and then write several versions of a function to map over the RDD
> until I'm satisfied with the transformation. Right now, once I do addJar() to
> add one version of the jar to the SparkContext, there's no way to add a new
> version of the jar unless I rename the classes and functions involved, or lose
> my current work by re-creating the SparkContext. Is there a better way to do
> this?
> 
> One idea that comes to mind is that we could add APIs to create "sub-contexts"
> from within a SparkContext. Jars added to a sub-context would get added to a
> child classloader on the executor, so that different sub-contexts could use
> classes with the same name while still being able to access on-heap objects
> for RDDs. If this makes sense conceptually, I'd like to work on a PR to add
> such functionality to Spark.
> 
> Punya
>

Re: Separating classloader management from SparkContexts

Posted by Andrew Ash <an...@andrewash.com>.

Hi Punya,

This seems like a problem that the recently-announced job-server would
likely have run into at one point.  I haven't tested it yet, but I'd be
interested to see what happens when two jobs in the job server have
conflicting classes.  Does the server correctly segregate each job's
classes from other concurrently-running jobs?

>From my reading of the code I think it may not work the way I'd want it to,
though there are a few classloader tricks going on.

https://github.com/ooyala/spark-jobserver/blob/master/job-server/src/spark.jobserver/JobCache.scala

In line 29 there the jar is added to the SparkContext, and in 30 the jar is
added to the job-server's local classloader.

Note all this PR related to classloaders -
https://github.com/apache/spark/pull/119

Andrew



On Tue, Mar 18, 2014 at 9:24 AM, Punya Biswal <pb...@palantir.com> wrote:

> Hi Spark people,
>
> Sorry to bug everyone again about this, but do people have any thoughts on
> whether sub-contexts would be a good way to solve this problem? I'm
> thinking of something like
>
> class SparkContext {
>   // ... stuff ...
>   def inSubContext[T](fn: SparkContext => T): T
> }
>
>
> this way, I could do something like
>
> val sc = /* get myself a spark context somehow */;
> val rdd = sc.textFile("/stuff.txt")
> sc.inSubContext { sc1 =>
>   sc1.addJar("extras-v1.jar")
>   print(sc1.filter(/* fn that depends on jar */).count)
> }
> sc.inSubContext { sc2 =>
>   sc2.addJar("extras-v2.jar")
>   print(sc2.filter(/* fn that depends on jar */).count)
> }
>
>
> ... even if classes in extras-v1.jar and extras-v2.jar have name
> collisions.
>
> Punya
>
> From: Punya Biswal <pb...@palantir.com>
> Reply-To: <us...@spark.apache.org>
> Date: Sunday, March 16, 2014 at 11:09 AM
> To: "user@spark.apache.org" <us...@spark.apache.org>
> Subject: Separating classloader management from SparkContexts
>
> Hi all,
>
> I'm trying to use Spark to support users who are interactively refining
> the code that processes their data. As a concrete example, I might create
> an RDD[String] and then write several versions of a function to map over
> the RDD until I'm satisfied with the transformation. Right now, once I do
> addJar() to add one version of the jar to the SparkContext, there's no way
> to add a new version of the jar unless I rename the classes and functions
> involved, or lose my current work by re-creating the SparkContext. Is there
> a better way to do this?
>
> One idea that comes to mind is that we could add APIs to create
> "sub-contexts" from within a SparkContext. Jars added to a sub-context
> would get added to a child classloader on the executor, so that different
> sub-contexts could use classes with the same name while still being able to
> access on-heap objects for RDDs. If this makes sense conceptually, I'd like
> to work on a PR to add such functionality to Spark.
>
> Punya
>
>

Re: Separating classloader management from SparkContexts

Posted by Punya Biswal <pb...@palantir.com>.

Hi Spark people,

Sorry to bug everyone again about this, but do people have any thoughts on
whether sub-contexts would be a good way to solve this problem? I'm thinking
of something like

class SparkContext {
  // ... stuff ...
  def inSubContext[T](fn: SparkContext => T): T
}

this way, I could do something like

val sc = /* get myself a spark context somehow */;
val rdd = sc.textFile("/stuff.txt")
sc.inSubContext { sc1 =>
  sc1.addJar("extras-v1.jar")
  print(sc1.filter(/* fn that depends on jar */).count)
}
sc.inSubContext { sc2 =>
  sc2.addJar("extras-v2.jar")
  print(sc2.filter(/* fn that depends on jar */).count)
}

... even if classes in extras-v1.jar and extras-v2.jar have name collisions.

Punya

From:  Punya Biswal <pb...@palantir.com>
Reply-To:  <us...@spark.apache.org>
Date:  Sunday, March 16, 2014 at 11:09 AM
To:  "user@spark.apache.org" <us...@spark.apache.org>
Subject:  Separating classloader management from SparkContexts

Hi all,

I'm trying to use Spark to support users who are interactively refining the
code that processes their data. As a concrete example, I might create an
RDD[String] and then write several versions of a function to map over the
RDD until I'm satisfied with the transformation. Right now, once I do
addJar() to add one version of the jar to the SparkContext, there's no way
to add a new version of the jar unless I rename the classes and functions
involved, or lose my current work by re-creating the SparkContext. Is there
a better way to do this?

One idea that comes to mind is that we could add APIs to create
"sub-contexts" from within a SparkContext. Jars added to a sub-context would
get added to a child classloader on the executor, so that different
sub-contexts could use classes with the same name while still being able to
access on-heap objects for RDDs. If this makes sense conceptually, I'd like
to work on a PR to add such functionality to Spark.

Punya