You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by DB Tsai <db...@stanford.edu> on 2014/05/16 10:46:34 UTC

Calling external classes added by sc.addJar needs to be through reflection

Finally find a way out of the ClassLoader maze! It took me some times to
understand how it works; I think it worths to document it in a separated
thread.

We're trying to add external utility.jar which contains CSVRecordParser,
and we added the jar to executors through sc.addJar APIs.

If the instance of CSVRecordParser is created without reflection, it
raises *ClassNotFound
Exception*.

data.mapPartitions(lines => {
    val csvParser = new CSVRecordParser((delimiter.charAt(0))
    lines.foreach(line => {
      val lineElems = csvParser.parseLine(line)
    })
    ...
    ...
 )


If the instance of CSVRecordParser is created through reflection, it works.

data.mapPartitions(lines => {
    val loader = Thread.currentThread.getContextClassLoader
    val CSVRecordParser =
        loader.loadClass("com.alpine.hadoop.ext.CSVRecordParser")

    val csvParser = CSVRecordParser.getConstructor(Character.TYPE)
        .newInstance(delimiter.charAt(0).asInstanceOf[Character])

    val parseLine = CSVRecordParser
        .getDeclaredMethod("parseLine", classOf[String])

    lines.foreach(line => {
       val lineElems = parseLine.invoke(csvParser,
line).asInstanceOf[Array[String]]
    })
    ...
    ...
 )


This is identical to this question,
http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-without-using-reflection

It's not intuitive for users to load external classes through reflection,
but couple available solutions including 1) messing around
systemClassLoader by calling systemClassLoader.addURI through reflection or
2) forking another JVM to add jars into classpath before bootstrap loader
are very tricky.

Any thought on fixing it properly?

@Xiangrui,
netlib-java jniloader is loaded from netlib-java through reflection, so
this problem will not be seen.

Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by DB Tsai <db...@stanford.edu>.
The jars are included in my driver, and I can successfully use them in the
driver. I'm working on a patch, and it's almost working. Will submit a PR
soon.


Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sun, May 18, 2014 at 11:58 AM, Patrick Wendell <pw...@gmail.com>wrote:

> @db - it's possible that you aren't including the jar in the classpath
> of your driver program (I think this is what mridul was suggesting).
> It would be helpful to see the stack trace of the CNFE.
>
> - Patrick
>
> On Sun, May 18, 2014 at 11:54 AM, Patrick Wendell <pw...@gmail.com>
> wrote:
> > @xiangrui - we don't expect these to be present on the system
> > classpath, because they get dynamically added by Spark (e.g. your
> > application can call sc.addJar well after the JVM's have started).
> >
> > @db - I'm pretty surprised to see that behavior. It's definitely not
> > intended that users need reflection to instantiate their classes -
> > something odd is going on in your case. If you could create an
> > isolated example and post it to the JIRA, that would be great.
> >
> > On Sun, May 18, 2014 at 9:58 AM, Xiangrui Meng <me...@gmail.com> wrote:
> >> I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870
> >>
> >> DB, could you add more info to that JIRA? Thanks!
> >>
> >> -Xiangrui
> >>
> >> On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng <me...@gmail.com>
> wrote:
> >>> Btw, I tried
> >>>
> >>> rdd.map { i =>
> >>>   System.getProperty("java.class.path")
> >>> }.collect()
> >>>
> >>> but didn't see the jars added via "--jars" on the executor classpath.
> >>>
> >>> -Xiangrui
> >>>
> >>> On Sat, May 17, 2014 at 11:26 PM, Xiangrui Meng <me...@gmail.com>
> wrote:
> >>>> I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The
> >>>> reflection approach mentioned by DB didn't work either. I checked the
> >>>> distributed cache on a worker node and found the jar there. It is also
> >>>> in the Environment tab of the WebUI. The workaround is making an
> >>>> assembly jar.
> >>>>
> >>>> DB, could you create a JIRA and describe what you have found so far?
> Thanks!
> >>>>
> >>>> Best,
> >>>> Xiangrui
> >>>>
> >>>> On Sat, May 17, 2014 at 1:29 AM, Mridul Muralidharan <
> mridul@gmail.com> wrote:
> >>>>> Can you try moving your mapPartitions to another class/object which
> is
> >>>>> referenced only after sc.addJar ?
> >>>>>
> >>>>> I would suspect CNFEx is coming while loading the class containing
> >>>>> mapPartitions before addJars is executed.
> >>>>>
> >>>>> In general though, dynamic loading of classes means you use
> reflection to
> >>>>> instantiate it since expectation is you don't know which
> implementation
> >>>>> provides the interface ... If you statically know it apriori, you
> bundle it
> >>>>> in your classpath.
> >>>>>
> >>>>> Regards
> >>>>> Mridul
> >>>>> On 17-May-2014 7:28 am, "DB Tsai" <db...@stanford.edu> wrote:
> >>>>>
> >>>>>> Finally find a way out of the ClassLoader maze! It took me some
> times to
> >>>>>> understand how it works; I think it worths to document it in a
> separated
> >>>>>> thread.
> >>>>>>
> >>>>>> We're trying to add external utility.jar which contains
> CSVRecordParser,
> >>>>>> and we added the jar to executors through sc.addJar APIs.
> >>>>>>
> >>>>>> If the instance of CSVRecordParser is created without reflection, it
> >>>>>> raises *ClassNotFound
> >>>>>> Exception*.
> >>>>>>
> >>>>>> data.mapPartitions(lines => {
> >>>>>>     val csvParser = new CSVRecordParser((delimiter.charAt(0))
> >>>>>>     lines.foreach(line => {
> >>>>>>       val lineElems = csvParser.parseLine(line)
> >>>>>>     })
> >>>>>>     ...
> >>>>>>     ...
> >>>>>>  )
> >>>>>>
> >>>>>>
> >>>>>> If the instance of CSVRecordParser is created through reflection,
> it works.
> >>>>>>
> >>>>>> data.mapPartitions(lines => {
> >>>>>>     val loader = Thread.currentThread.getContextClassLoader
> >>>>>>     val CSVRecordParser =
> >>>>>>         loader.loadClass("com.alpine.hadoop.ext.CSVRecordParser")
> >>>>>>
> >>>>>>     val csvParser = CSVRecordParser.getConstructor(Character.TYPE)
> >>>>>>         .newInstance(delimiter.charAt(0).asInstanceOf[Character])
> >>>>>>
> >>>>>>     val parseLine = CSVRecordParser
> >>>>>>         .getDeclaredMethod("parseLine", classOf[String])
> >>>>>>
> >>>>>>     lines.foreach(line => {
> >>>>>>        val lineElems = parseLine.invoke(csvParser,
> >>>>>> line).asInstanceOf[Array[String]]
> >>>>>>     })
> >>>>>>     ...
> >>>>>>     ...
> >>>>>>  )
> >>>>>>
> >>>>>>
> >>>>>> This is identical to this question,
> >>>>>>
> >>>>>>
> http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-without-using-reflection
> >>>>>>
> >>>>>> It's not intuitive for users to load external classes through
> reflection,
> >>>>>> but couple available solutions including 1) messing around
> >>>>>> systemClassLoader by calling systemClassLoader.addURI through
> reflection or
> >>>>>> 2) forking another JVM to add jars into classpath before bootstrap
> loader
> >>>>>> are very tricky.
> >>>>>>
> >>>>>> Any thought on fixing it properly?
> >>>>>>
> >>>>>> @Xiangrui,
> >>>>>> netlib-java jniloader is loaded from netlib-java through
> reflection, so
> >>>>>> this problem will not be seen.
> >>>>>>
> >>>>>> Sincerely,
> >>>>>>
> >>>>>> DB Tsai
> >>>>>> -------------------------------------------------------
> >>>>>> My Blog: https://www.dbtsai.com
> >>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
> >>>>>>
>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Xiangrui Meng <me...@gmail.com>.
Hi Patrick,

If spark-submit works correctly, user only needs to specify runtime
jars via `--jars` instead of using `sc.addJar`. Is it correct? I
checked SparkSubmit and yarn.Client but didn't find any code to handle
`args.jars` for YARN mode. So I don't know where in the code the jars
in the distributed cache are added to runtime classpath on executors.

Best,
Xiangrui

On Sun, May 18, 2014 at 11:58 AM, Patrick Wendell <pw...@gmail.com> wrote:
> @db - it's possible that you aren't including the jar in the classpath
> of your driver program (I think this is what mridul was suggesting).
> It would be helpful to see the stack trace of the CNFE.
>
> - Patrick
>
> On Sun, May 18, 2014 at 11:54 AM, Patrick Wendell <pw...@gmail.com> wrote:
>> @xiangrui - we don't expect these to be present on the system
>> classpath, because they get dynamically added by Spark (e.g. your
>> application can call sc.addJar well after the JVM's have started).
>>
>> @db - I'm pretty surprised to see that behavior. It's definitely not
>> intended that users need reflection to instantiate their classes -
>> something odd is going on in your case. If you could create an
>> isolated example and post it to the JIRA, that would be great.
>>
>> On Sun, May 18, 2014 at 9:58 AM, Xiangrui Meng <me...@gmail.com> wrote:
>>> I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870
>>>
>>> DB, could you add more info to that JIRA? Thanks!
>>>
>>> -Xiangrui
>>>
>>> On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng <me...@gmail.com> wrote:
>>>> Btw, I tried
>>>>
>>>> rdd.map { i =>
>>>>   System.getProperty("java.class.path")
>>>> }.collect()
>>>>
>>>> but didn't see the jars added via "--jars" on the executor classpath.
>>>>
>>>> -Xiangrui
>>>>
>>>> On Sat, May 17, 2014 at 11:26 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>>>> I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The
>>>>> reflection approach mentioned by DB didn't work either. I checked the
>>>>> distributed cache on a worker node and found the jar there. It is also
>>>>> in the Environment tab of the WebUI. The workaround is making an
>>>>> assembly jar.
>>>>>
>>>>> DB, could you create a JIRA and describe what you have found so far? Thanks!
>>>>>
>>>>> Best,
>>>>> Xiangrui
>>>>>
>>>>> On Sat, May 17, 2014 at 1:29 AM, Mridul Muralidharan <mr...@gmail.com> wrote:
>>>>>> Can you try moving your mapPartitions to another class/object which is
>>>>>> referenced only after sc.addJar ?
>>>>>>
>>>>>> I would suspect CNFEx is coming while loading the class containing
>>>>>> mapPartitions before addJars is executed.
>>>>>>
>>>>>> In general though, dynamic loading of classes means you use reflection to
>>>>>> instantiate it since expectation is you don't know which implementation
>>>>>> provides the interface ... If you statically know it apriori, you bundle it
>>>>>> in your classpath.
>>>>>>
>>>>>> Regards
>>>>>> Mridul
>>>>>> On 17-May-2014 7:28 am, "DB Tsai" <db...@stanford.edu> wrote:
>>>>>>
>>>>>>> Finally find a way out of the ClassLoader maze! It took me some times to
>>>>>>> understand how it works; I think it worths to document it in a separated
>>>>>>> thread.
>>>>>>>
>>>>>>> We're trying to add external utility.jar which contains CSVRecordParser,
>>>>>>> and we added the jar to executors through sc.addJar APIs.
>>>>>>>
>>>>>>> If the instance of CSVRecordParser is created without reflection, it
>>>>>>> raises *ClassNotFound
>>>>>>> Exception*.
>>>>>>>
>>>>>>> data.mapPartitions(lines => {
>>>>>>>     val csvParser = new CSVRecordParser((delimiter.charAt(0))
>>>>>>>     lines.foreach(line => {
>>>>>>>       val lineElems = csvParser.parseLine(line)
>>>>>>>     })
>>>>>>>     ...
>>>>>>>     ...
>>>>>>>  )
>>>>>>>
>>>>>>>
>>>>>>> If the instance of CSVRecordParser is created through reflection, it works.
>>>>>>>
>>>>>>> data.mapPartitions(lines => {
>>>>>>>     val loader = Thread.currentThread.getContextClassLoader
>>>>>>>     val CSVRecordParser =
>>>>>>>         loader.loadClass("com.alpine.hadoop.ext.CSVRecordParser")
>>>>>>>
>>>>>>>     val csvParser = CSVRecordParser.getConstructor(Character.TYPE)
>>>>>>>         .newInstance(delimiter.charAt(0).asInstanceOf[Character])
>>>>>>>
>>>>>>>     val parseLine = CSVRecordParser
>>>>>>>         .getDeclaredMethod("parseLine", classOf[String])
>>>>>>>
>>>>>>>     lines.foreach(line => {
>>>>>>>        val lineElems = parseLine.invoke(csvParser,
>>>>>>> line).asInstanceOf[Array[String]]
>>>>>>>     })
>>>>>>>     ...
>>>>>>>     ...
>>>>>>>  )
>>>>>>>
>>>>>>>
>>>>>>> This is identical to this question,
>>>>>>>
>>>>>>> http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-without-using-reflection
>>>>>>>
>>>>>>> It's not intuitive for users to load external classes through reflection,
>>>>>>> but couple available solutions including 1) messing around
>>>>>>> systemClassLoader by calling systemClassLoader.addURI through reflection or
>>>>>>> 2) forking another JVM to add jars into classpath before bootstrap loader
>>>>>>> are very tricky.
>>>>>>>
>>>>>>> Any thought on fixing it properly?
>>>>>>>
>>>>>>> @Xiangrui,
>>>>>>> netlib-java jniloader is loaded from netlib-java through reflection, so
>>>>>>> this problem will not be seen.
>>>>>>>
>>>>>>> Sincerely,
>>>>>>>
>>>>>>> DB Tsai
>>>>>>> -------------------------------------------------------
>>>>>>> My Blog: https://www.dbtsai.com
>>>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Patrick Wendell <pw...@gmail.com>.
Hey I just looked at the fix here:
https://github.com/apache/spark/pull/848

Given that this is quite simple, maybe it's best to just go with this
and just explain that we don't support adding jars dynamically in YARN
in Spark 1.0. That seems like a reasonable thing to do.

On Wed, May 21, 2014 at 3:15 PM, Patrick Wendell <pw...@gmail.com> wrote:
> Of these two solutions I'd definitely prefer 2 in the short term. I'd
> imagine the fix is very straightforward (it would mostly just be
> remove code), and we'd be making this more consistent with the
> standalone mode which makes things way easier to reason about.
>
> In the long term we'll definitely want to exploit the distributed
> cache more, but at this point it's premature optimization at a high
> complexity cost. Writing stuff to HDFS through is so slow anyways I'd
> guess that serving it directly from the driver is still faster in most
> cases (though for very large jar sizes or very large clusters, yes,
> we'll need the distributed cache).
>
> - Patrick
>
> On Wed, May 21, 2014 at 2:41 PM, Xiangrui Meng <me...@gmail.com> wrote:
>> That's a good example. If we really want to cover that case, there are
>> two solutions:
>>
>> 1. Follow DB's patch, adding jars to the system classloader. Then we
>> cannot put a user class in front of an existing class.
>> 2. Do not send the primary jar and secondary jars to executors'
>> distributed cache. Instead, add them to "spark.jars" in SparkSubmit
>> and serve them via http by called sc.addJar in SparkContext.
>>
>> What is your preference?
>>
>> On Wed, May 21, 2014 at 2:27 PM, Sandy Ryza <sa...@cloudera.com> wrote:
>>> Is that an assumption we can make?  I think we'd run into an issue in this
>>> situation:
>>>
>>> *In primary jar:*
>>> def makeDynamicObject(clazz: String) = Class.forName(clazz).newInstance()
>>>
>>> *In app code:*
>>> sc.addJar("dynamicjar.jar")
>>> ...
>>> rdd.map(x => makeDynamicObject("some.class.from.DynamicJar"))
>>>
>>> It might be fair to say that the user should make sure to use the context
>>> classloader when instantiating dynamic classes, but I think it's weird that
>>> this code would work on Spark standalone but not on YARN.
>>>
>>> -Sandy
>>>
>>>
>>> On Wed, May 21, 2014 at 2:10 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>>
>>>> I think adding jars dynamically should work as long as the primary jar
>>>> and the secondary jars do not depend on dynamically added jars, which
>>>> should be the correct logic. -Xiangrui
>>>>
>>>> On Wed, May 21, 2014 at 1:40 PM, DB Tsai <db...@stanford.edu> wrote:
>>>> > This will be another separate story.
>>>> >
>>>> > Since in the yarn deployment, as Sandy said, the app.jar will be always
>>>> in
>>>> > the systemclassloader which means any object instantiated in app.jar will
>>>> > have parent loader of systemclassloader instead of custom one. As a
>>>> result,
>>>> > the custom class loader in yarn will never work without specifically
>>>> using
>>>> > reflection.
>>>> >
>>>> > Solution will be not using system classloader in the classloader
>>>> hierarchy,
>>>> > and add all the resources in system one into custom one. This is the
>>>> > approach of tomcat takes.
>>>> >
>>>> > Or we can directly overwirte the system class loader by calling the
>>>> > protected method `addURL` which will not work and throw exception if the
>>>> > code is wrapped in security manager.
>>>> >
>>>> >
>>>> > Sincerely,
>>>> >
>>>> > DB Tsai
>>>> > -------------------------------------------------------
>>>> > My Blog: https://www.dbtsai.com
>>>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>>>> >
>>>> >
>>>> > On Wed, May 21, 2014 at 1:13 PM, Sandy Ryza <sa...@cloudera.com>
>>>> wrote:
>>>> >
>>>> >> This will solve the issue for jars added upon application submission,
>>>> but,
>>>> >> on top of this, we need to make sure that anything dynamically added
>>>> >> through sc.addJar works as well.
>>>> >>
>>>> >> To do so, we need to make sure that any jars retrieved via the driver's
>>>> >> HTTP server are loaded by the same classloader that loads the jars
>>>> given on
>>>> >> app submission.  To achieve this, we need to either use the same
>>>> >> classloader for both system jars and user jars, or make sure that the
>>>> user
>>>> >> jars given on app submission are under the same classloader used for
>>>> >> dynamically added jars.
>>>> >>
>>>> >> On Tue, May 20, 2014 at 5:59 PM, Xiangrui Meng <me...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> > Talked with Sandy and DB offline. I think the best solution is sending
>>>> >> > the secondary jars to the distributed cache of all containers rather
>>>> >> > than just the master, and set the classpath to include spark jar,
>>>> >> > primary app jar, and secondary jars before executor starts. In this
>>>> >> > way, user only needs to specify secondary jars via --jars instead of
>>>> >> > calling sc.addJar inside the code. It also solves the scalability
>>>> >> > problem of serving all the jars via http.
>>>> >> >
>>>> >> > If this solution sounds good, I can try to make a patch.
>>>> >> >
>>>> >> > Best,
>>>> >> > Xiangrui
>>>> >> >
>>>> >> > On Mon, May 19, 2014 at 10:04 PM, DB Tsai <db...@stanford.edu>
>>>> wrote:
>>>> >> > > In 1.0, there is a new option for users to choose which classloader
>>>> has
>>>> >> > > higher priority via spark.files.userClassPathFirst, I decided to
>>>> submit
>>>> >> > the
>>>> >> > > PR for 0.9 first. We use this patch in our lab and we can use those
>>>> >> jars
>>>> >> > > added by sc.addJar without reflection.
>>>> >> > >
>>>> >> > > https://github.com/apache/spark/pull/834
>>>> >> > >
>>>> >> > > Can anyone comment if it's a good approach?
>>>> >> > >
>>>> >> > > Thanks.
>>>> >> > >
>>>> >> > >
>>>> >> > > Sincerely,
>>>> >> > >
>>>> >> > > DB Tsai
>>>> >> > > -------------------------------------------------------
>>>> >> > > My Blog: https://www.dbtsai.com
>>>> >> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>>>> >> > >
>>>> >> > >
>>>> >> > > On Mon, May 19, 2014 at 7:42 PM, DB Tsai <db...@stanford.edu>
>>>> wrote:
>>>> >> > >
>>>> >> > >> Good summary! We fixed it in branch 0.9 since our production is
>>>> still
>>>> >> in
>>>> >> > >> 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for
>>>> 1.0
>>>> >> > >> tonight.
>>>> >> > >>
>>>> >> > >>
>>>> >> > >> Sincerely,
>>>> >> > >>
>>>> >> > >> DB Tsai
>>>> >> > >> -------------------------------------------------------
>>>> >> > >> My Blog: https://www.dbtsai.com
>>>> >> > >> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>> >> > >>
>>>> >> > >>
>>>> >> > >> On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <
>>>> sandy.ryza@cloudera.com
>>>> >> > >wrote:
>>>> >> > >>
>>>> >> > >>> It just hit me why this problem is showing up on YARN and not on
>>>> >> > >>> standalone.
>>>> >> > >>>
>>>> >> > >>> The relevant difference between YARN and standalone is that, on
>>>> YARN,
>>>> >> > the
>>>> >> > >>> app jar is loaded by the system classloader instead of Spark's
>>>> custom
>>>> >> > URL
>>>> >> > >>> classloader.
>>>> >> > >>>
>>>> >> > >>> On YARN, the system classloader knows about [the classes in the
>>>> spark
>>>> >> > >>> jars,
>>>> >> > >>> the classes in the primary app jar].   The custom classloader
>>>> knows
>>>> >> > about
>>>> >> > >>> [the classes in secondary app jars] and has the system
>>>> classloader as
>>>> >> > its
>>>> >> > >>> parent.
>>>> >> > >>>
>>>> >> > >>> A few relevant facts (mostly redundant with what Sean pointed
>>>> out):
>>>> >> > >>> * Every class has a classloader that loaded it.
>>>> >> > >>> * When an object of class B is instantiated inside of class A, the
>>>> >> > >>> classloader used for loading B is the classloader that was used
>>>> for
>>>> >> > >>> loading
>>>> >> > >>> A.
>>>> >> > >>> * When a classloader fails to load a class, it lets its parent
>>>> >> > classloader
>>>> >> > >>> try.  If its parent succeeds, its parent becomes the "classloader
>>>> >> that
>>>> >> > >>> loaded it".
>>>> >> > >>>
>>>> >> > >>> So suppose class B is in a secondary app jar and class A is in the
>>>> >> > primary
>>>> >> > >>> app jar:
>>>> >> > >>> 1. The custom classloader will try to load class A.
>>>> >> > >>> 2. It will fail, because it only knows about the secondary jars.
>>>> >> > >>> 3. It will delegate to its parent, the system classloader.
>>>> >> > >>> 4. The system classloader will succeed, because it knows about the
>>>> >> > primary
>>>> >> > >>> app jar.
>>>> >> > >>> 5. A's classloader will be the system classloader.
>>>> >> > >>> 6. A tries to instantiate an instance of class B.
>>>> >> > >>> 7. B will be loaded with A's classloader, which is the system
>>>> >> > classloader.
>>>> >> > >>> 8. Loading B will fail, because A's classloader, which is the
>>>> system
>>>> >> > >>> classloader, doesn't know about the secondary app jars.
>>>> >> > >>>
>>>> >> > >>> In Spark standalone, A and B are both loaded by the custom
>>>> >> > classloader, so
>>>> >> > >>> this issue doesn't come up.
>>>> >> > >>>
>>>> >> > >>> -Sandy
>>>> >> > >>>
>>>> >> > >>> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <
>>>> pwendell@gmail.com
>>>> >> >
>>>> >> > >>> wrote:
>>>> >> > >>>
>>>> >> > >>> > Having a user add define a custom class inside of an added jar
>>>> and
>>>> >> > >>> > instantiate it directly inside of an executor is definitely
>>>> >> supported
>>>> >> > >>> > in Spark and has been for a really long time (several years).
>>>> This
>>>> >> is
>>>> >> > >>> > something we do all the time in Spark.
>>>> >> > >>> >
>>>> >> > >>> > DB - I'd hold off on a re-architecting of this until we identify
>>>> >> > >>> > exactly what is causing the bug you are running into.
>>>> >> > >>> >
>>>> >> > >>> > In a nutshell, when the bytecode "new Foo()" is run on the
>>>> >> executor,
>>>> >> > >>> > it will ask the driver for the class over HTTP using a custom
>>>> >> > >>> > classloader. Something in that pipeline is breaking here,
>>>> possibly
>>>> >> > >>> > related to the YARN deployment stuff.
>>>> >> > >>> >
>>>> >> > >>> >
>>>> >> > >>> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <sowen@cloudera.com
>>>> >
>>>> >> > wrote:
>>>> >> > >>> > > I don't think a customer classloader is necessary.
>>>> >> > >>> > >
>>>> >> > >>> > > Well, it occurs to me that this is no new problem. Hadoop,
>>>> >> Tomcat,
>>>> >> > etc
>>>> >> > >>> > > all run custom user code that creates new user objects without
>>>> >> > >>> > > reflection. I should go see how that's done. Maybe it's
>>>> totally
>>>> >> > valid
>>>> >> > >>> > > to set the thread's context classloader for just this purpose,
>>>> >> and
>>>> >> > I
>>>> >> > >>> > > am not thinking clearly.
>>>> >> > >>> > >
>>>> >> > >>> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <
>>>> >> andrew@andrewash.com>
>>>> >> > >>> > wrote:
>>>> >> > >>> > >> Sounds like the problem is that classloaders always look in
>>>> >> their
>>>> >> > >>> > parents
>>>> >> > >>> > >> before themselves, and Spark users want executors to pick up
>>>> >> > classes
>>>> >> > >>> > from
>>>> >> > >>> > >> their custom code before the ones in Spark plus its
>>>> >> dependencies.
>>>> >> > >>> > >>
>>>> >> > >>> > >> Would a custom classloader that delegates to the parent after
>>>> >> > first
>>>> >> > >>> > >> checking itself fix this up?
>>>> >> > >>> > >>
>>>> >> > >>> > >>
>>>> >> > >>> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <
>>>> dbtsai@stanford.edu>
>>>> >> > >>> wrote:
>>>> >> > >>> > >>
>>>> >> > >>> > >>> Hi Sean,
>>>> >> > >>> > >>>
>>>> >> > >>> > >>> It's true that the issue here is classloader, and due to the
>>>> >> > >>> > classloader
>>>> >> > >>> > >>> delegation model, users have to use reflection in the
>>>> executors
>>>> >> > to
>>>> >> > >>> > pick up
>>>> >> > >>> > >>> the classloader in order to use those classes added by
>>>> >> sc.addJars
>>>> >> > >>> APIs.
>>>> >> > >>> > >>> However, it's very inconvenience for users, and not
>>>> documented
>>>> >> in
>>>> >> > >>> > spark.
>>>> >> > >>> > >>>
>>>> >> > >>> > >>> I'm working on a patch to solve it by calling the protected
>>>> >> > method
>>>> >> > >>> > addURL
>>>> >> > >>> > >>> in URLClassLoader to update the current default
>>>> classloader, so
>>>> >> > no
>>>> >> > >>> > >>> customClassLoader anymore. I wonder if this is an good way
>>>> to
>>>> >> go.
>>>> >> > >>> > >>>
>>>> >> > >>> > >>>   private def addURL(url: URL, loader: URLClassLoader){
>>>> >> > >>> > >>>     try {
>>>> >> > >>> > >>>       val method: Method =
>>>> >> > >>> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL",
>>>> >> classOf[URL])
>>>> >> > >>> > >>>       method.setAccessible(true)
>>>> >> > >>> > >>>       method.invoke(loader, url)
>>>> >> > >>> > >>>     }
>>>> >> > >>> > >>>     catch {
>>>> >> > >>> > >>>       case t: Throwable => {
>>>> >> > >>> > >>>         throw new IOException("Error, could not add URL to
>>>> >> system
>>>> >> > >>> > >>> classloader")
>>>> >> > >>> > >>>       }
>>>> >> > >>> > >>>     }
>>>> >> > >>> > >>>   }
>>>> >> > >>> > >>>
>>>> >> > >>> > >>>
>>>> >> > >>> > >>>
>>>> >> > >>> > >>> Sincerely,
>>>> >> > >>> > >>>
>>>> >> > >>> > >>> DB Tsai
>>>> >> > >>> > >>> -------------------------------------------------------
>>>> >> > >>> > >>> My Blog: https://www.dbtsai.com
>>>> >> > >>> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>> >> > >>> > >>>
>>>> >> > >>> > >>>
>>>> >> > >>> > >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <
>>>> >> sowen@cloudera.com>
>>>> >> > >>> > wrote:
>>>> >> > >>> > >>>
>>>> >> > >>> > >>> > I might be stating the obvious for everyone, but the issue
>>>> >> > here is
>>>> >> > >>> > not
>>>> >> > >>> > >>> > reflection or the source of the JAR, but the ClassLoader.
>>>> The
>>>> >> > >>> basic
>>>> >> > >>> > >>> > rules are this.
>>>> >> > >>> > >>> >
>>>> >> > >>> > >>> > "new Foo" will use the ClassLoader that defines Foo. This
>>>> is
>>>> >> > >>> usually
>>>> >> > >>> > >>> > the ClassLoader that loaded whatever it is that first
>>>> >> > referenced
>>>> >> > >>> Foo
>>>> >> > >>> > >>> > and caused it to be loaded -- usually the ClassLoader
>>>> holding
>>>> >> > your
>>>> >> > >>> > >>> > other app classes.
>>>> >> > >>> > >>> >
>>>> >> > >>> > >>> > ClassLoaders can have a parent-child relationship.
>>>> >> ClassLoaders
>>>> >> > >>> > always
>>>> >> > >>> > >>> > look in their parent before themselves.
>>>> >> > >>> > >>> >
>>>> >> > >>> > >>> > (Careful then -- in contexts like Hadoop or Tomcat where
>>>> your
>>>> >> > app
>>>> >> > >>> is
>>>> >> > >>> > >>> > loaded in a child ClassLoader, and you reference a class
>>>> that
>>>> >> > >>> Hadoop
>>>> >> > >>> > >>> > or Tomcat also has (like a lib class) you will get the
>>>> >> > container's
>>>> >> > >>> > >>> > version!)
>>>> >> > >>> > >>> >
>>>> >> > >>> > >>> > When you load an external JAR it has a separate
>>>> ClassLoader
>>>> >> > which
>>>> >> > >>> > does
>>>> >> > >>> > >>> > not necessarily bear any relation to the one containing
>>>> your
>>>> >> > app
>>>> >> > >>> > >>> > classes, so yeah it is not generally going to make "new
>>>> Foo"
>>>> >> > work.
>>>> >> > >>> > >>> >
>>>> >> > >>> > >>> > Reflection lets you pick the ClassLoader, yes.
>>>> >> > >>> > >>> >
>>>> >> > >>> > >>> > I would not call setContextClassLoader.
>>>> >> > >>> > >>> >
>>>> >> > >>> > >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <
>>>> >> > >>> > sandy.ryza@cloudera.com>
>>>> >> > >>> > >>> > wrote:
>>>> >> > >>> > >>> > > I spoke with DB offline about this a little while ago
>>>> and
>>>> >> he
>>>> >> > >>> > confirmed
>>>> >> > >>> > >>> > that
>>>> >> > >>> > >>> > > he was able to access the jar from the driver.
>>>> >> > >>> > >>> > >
>>>> >> > >>> > >>> > > The issue appears to be a general Java issue: you can't
>>>> >> > directly
>>>> >> > >>> > >>> > > instantiate a class from a dynamically loaded jar.
>>>> >> > >>> > >>> > >
>>>> >> > >>> > >>> > > I reproduced it locally outside of Spark with:
>>>> >> > >>> > >>> > > ---
>>>> >> > >>> > >>> > >     URLClassLoader urlClassLoader = new
>>>> URLClassLoader(new
>>>> >> > >>> URL[] {
>>>> >> > >>> > new
>>>> >> > >>> > >>> > > File("myotherjar.jar").toURI().toURL() }, null);
>>>> >> > >>> > >>> > >
>>>> >> > >>> Thread.currentThread().setContextClassLoader(urlClassLoader);
>>>> >> > >>> > >>> > >     MyClassFromMyOtherJar obj = new
>>>> >> MyClassFromMyOtherJar();
>>>> >> > >>> > >>> > > ---
>>>> >> > >>> > >>> > >
>>>> >> > >>> > >>> > > I was able to load the class with reflection.
>>>> >> > >>> > >>> >
>>>> >> > >>> > >>>
>>>> >> > >>> >
>>>> >> > >>>
>>>> >> > >>
>>>> >> > >>
>>>> >> >
>>>> >>
>>>>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Patrick Wendell <pw...@gmail.com>.
Of these two solutions I'd definitely prefer 2 in the short term. I'd
imagine the fix is very straightforward (it would mostly just be
remove code), and we'd be making this more consistent with the
standalone mode which makes things way easier to reason about.

In the long term we'll definitely want to exploit the distributed
cache more, but at this point it's premature optimization at a high
complexity cost. Writing stuff to HDFS through is so slow anyways I'd
guess that serving it directly from the driver is still faster in most
cases (though for very large jar sizes or very large clusters, yes,
we'll need the distributed cache).

- Patrick

On Wed, May 21, 2014 at 2:41 PM, Xiangrui Meng <me...@gmail.com> wrote:
> That's a good example. If we really want to cover that case, there are
> two solutions:
>
> 1. Follow DB's patch, adding jars to the system classloader. Then we
> cannot put a user class in front of an existing class.
> 2. Do not send the primary jar and secondary jars to executors'
> distributed cache. Instead, add them to "spark.jars" in SparkSubmit
> and serve them via http by called sc.addJar in SparkContext.
>
> What is your preference?
>
> On Wed, May 21, 2014 at 2:27 PM, Sandy Ryza <sa...@cloudera.com> wrote:
>> Is that an assumption we can make?  I think we'd run into an issue in this
>> situation:
>>
>> *In primary jar:*
>> def makeDynamicObject(clazz: String) = Class.forName(clazz).newInstance()
>>
>> *In app code:*
>> sc.addJar("dynamicjar.jar")
>> ...
>> rdd.map(x => makeDynamicObject("some.class.from.DynamicJar"))
>>
>> It might be fair to say that the user should make sure to use the context
>> classloader when instantiating dynamic classes, but I think it's weird that
>> this code would work on Spark standalone but not on YARN.
>>
>> -Sandy
>>
>>
>> On Wed, May 21, 2014 at 2:10 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>
>>> I think adding jars dynamically should work as long as the primary jar
>>> and the secondary jars do not depend on dynamically added jars, which
>>> should be the correct logic. -Xiangrui
>>>
>>> On Wed, May 21, 2014 at 1:40 PM, DB Tsai <db...@stanford.edu> wrote:
>>> > This will be another separate story.
>>> >
>>> > Since in the yarn deployment, as Sandy said, the app.jar will be always
>>> in
>>> > the systemclassloader which means any object instantiated in app.jar will
>>> > have parent loader of systemclassloader instead of custom one. As a
>>> result,
>>> > the custom class loader in yarn will never work without specifically
>>> using
>>> > reflection.
>>> >
>>> > Solution will be not using system classloader in the classloader
>>> hierarchy,
>>> > and add all the resources in system one into custom one. This is the
>>> > approach of tomcat takes.
>>> >
>>> > Or we can directly overwirte the system class loader by calling the
>>> > protected method `addURL` which will not work and throw exception if the
>>> > code is wrapped in security manager.
>>> >
>>> >
>>> > Sincerely,
>>> >
>>> > DB Tsai
>>> > -------------------------------------------------------
>>> > My Blog: https://www.dbtsai.com
>>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>>> >
>>> >
>>> > On Wed, May 21, 2014 at 1:13 PM, Sandy Ryza <sa...@cloudera.com>
>>> wrote:
>>> >
>>> >> This will solve the issue for jars added upon application submission,
>>> but,
>>> >> on top of this, we need to make sure that anything dynamically added
>>> >> through sc.addJar works as well.
>>> >>
>>> >> To do so, we need to make sure that any jars retrieved via the driver's
>>> >> HTTP server are loaded by the same classloader that loads the jars
>>> given on
>>> >> app submission.  To achieve this, we need to either use the same
>>> >> classloader for both system jars and user jars, or make sure that the
>>> user
>>> >> jars given on app submission are under the same classloader used for
>>> >> dynamically added jars.
>>> >>
>>> >> On Tue, May 20, 2014 at 5:59 PM, Xiangrui Meng <me...@gmail.com>
>>> wrote:
>>> >>
>>> >> > Talked with Sandy and DB offline. I think the best solution is sending
>>> >> > the secondary jars to the distributed cache of all containers rather
>>> >> > than just the master, and set the classpath to include spark jar,
>>> >> > primary app jar, and secondary jars before executor starts. In this
>>> >> > way, user only needs to specify secondary jars via --jars instead of
>>> >> > calling sc.addJar inside the code. It also solves the scalability
>>> >> > problem of serving all the jars via http.
>>> >> >
>>> >> > If this solution sounds good, I can try to make a patch.
>>> >> >
>>> >> > Best,
>>> >> > Xiangrui
>>> >> >
>>> >> > On Mon, May 19, 2014 at 10:04 PM, DB Tsai <db...@stanford.edu>
>>> wrote:
>>> >> > > In 1.0, there is a new option for users to choose which classloader
>>> has
>>> >> > > higher priority via spark.files.userClassPathFirst, I decided to
>>> submit
>>> >> > the
>>> >> > > PR for 0.9 first. We use this patch in our lab and we can use those
>>> >> jars
>>> >> > > added by sc.addJar without reflection.
>>> >> > >
>>> >> > > https://github.com/apache/spark/pull/834
>>> >> > >
>>> >> > > Can anyone comment if it's a good approach?
>>> >> > >
>>> >> > > Thanks.
>>> >> > >
>>> >> > >
>>> >> > > Sincerely,
>>> >> > >
>>> >> > > DB Tsai
>>> >> > > -------------------------------------------------------
>>> >> > > My Blog: https://www.dbtsai.com
>>> >> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>>> >> > >
>>> >> > >
>>> >> > > On Mon, May 19, 2014 at 7:42 PM, DB Tsai <db...@stanford.edu>
>>> wrote:
>>> >> > >
>>> >> > >> Good summary! We fixed it in branch 0.9 since our production is
>>> still
>>> >> in
>>> >> > >> 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for
>>> 1.0
>>> >> > >> tonight.
>>> >> > >>
>>> >> > >>
>>> >> > >> Sincerely,
>>> >> > >>
>>> >> > >> DB Tsai
>>> >> > >> -------------------------------------------------------
>>> >> > >> My Blog: https://www.dbtsai.com
>>> >> > >> LinkedIn: https://www.linkedin.com/in/dbtsai
>>> >> > >>
>>> >> > >>
>>> >> > >> On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <
>>> sandy.ryza@cloudera.com
>>> >> > >wrote:
>>> >> > >>
>>> >> > >>> It just hit me why this problem is showing up on YARN and not on
>>> >> > >>> standalone.
>>> >> > >>>
>>> >> > >>> The relevant difference between YARN and standalone is that, on
>>> YARN,
>>> >> > the
>>> >> > >>> app jar is loaded by the system classloader instead of Spark's
>>> custom
>>> >> > URL
>>> >> > >>> classloader.
>>> >> > >>>
>>> >> > >>> On YARN, the system classloader knows about [the classes in the
>>> spark
>>> >> > >>> jars,
>>> >> > >>> the classes in the primary app jar].   The custom classloader
>>> knows
>>> >> > about
>>> >> > >>> [the classes in secondary app jars] and has the system
>>> classloader as
>>> >> > its
>>> >> > >>> parent.
>>> >> > >>>
>>> >> > >>> A few relevant facts (mostly redundant with what Sean pointed
>>> out):
>>> >> > >>> * Every class has a classloader that loaded it.
>>> >> > >>> * When an object of class B is instantiated inside of class A, the
>>> >> > >>> classloader used for loading B is the classloader that was used
>>> for
>>> >> > >>> loading
>>> >> > >>> A.
>>> >> > >>> * When a classloader fails to load a class, it lets its parent
>>> >> > classloader
>>> >> > >>> try.  If its parent succeeds, its parent becomes the "classloader
>>> >> that
>>> >> > >>> loaded it".
>>> >> > >>>
>>> >> > >>> So suppose class B is in a secondary app jar and class A is in the
>>> >> > primary
>>> >> > >>> app jar:
>>> >> > >>> 1. The custom classloader will try to load class A.
>>> >> > >>> 2. It will fail, because it only knows about the secondary jars.
>>> >> > >>> 3. It will delegate to its parent, the system classloader.
>>> >> > >>> 4. The system classloader will succeed, because it knows about the
>>> >> > primary
>>> >> > >>> app jar.
>>> >> > >>> 5. A's classloader will be the system classloader.
>>> >> > >>> 6. A tries to instantiate an instance of class B.
>>> >> > >>> 7. B will be loaded with A's classloader, which is the system
>>> >> > classloader.
>>> >> > >>> 8. Loading B will fail, because A's classloader, which is the
>>> system
>>> >> > >>> classloader, doesn't know about the secondary app jars.
>>> >> > >>>
>>> >> > >>> In Spark standalone, A and B are both loaded by the custom
>>> >> > classloader, so
>>> >> > >>> this issue doesn't come up.
>>> >> > >>>
>>> >> > >>> -Sandy
>>> >> > >>>
>>> >> > >>> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <
>>> pwendell@gmail.com
>>> >> >
>>> >> > >>> wrote:
>>> >> > >>>
>>> >> > >>> > Having a user add define a custom class inside of an added jar
>>> and
>>> >> > >>> > instantiate it directly inside of an executor is definitely
>>> >> supported
>>> >> > >>> > in Spark and has been for a really long time (several years).
>>> This
>>> >> is
>>> >> > >>> > something we do all the time in Spark.
>>> >> > >>> >
>>> >> > >>> > DB - I'd hold off on a re-architecting of this until we identify
>>> >> > >>> > exactly what is causing the bug you are running into.
>>> >> > >>> >
>>> >> > >>> > In a nutshell, when the bytecode "new Foo()" is run on the
>>> >> executor,
>>> >> > >>> > it will ask the driver for the class over HTTP using a custom
>>> >> > >>> > classloader. Something in that pipeline is breaking here,
>>> possibly
>>> >> > >>> > related to the YARN deployment stuff.
>>> >> > >>> >
>>> >> > >>> >
>>> >> > >>> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <sowen@cloudera.com
>>> >
>>> >> > wrote:
>>> >> > >>> > > I don't think a customer classloader is necessary.
>>> >> > >>> > >
>>> >> > >>> > > Well, it occurs to me that this is no new problem. Hadoop,
>>> >> Tomcat,
>>> >> > etc
>>> >> > >>> > > all run custom user code that creates new user objects without
>>> >> > >>> > > reflection. I should go see how that's done. Maybe it's
>>> totally
>>> >> > valid
>>> >> > >>> > > to set the thread's context classloader for just this purpose,
>>> >> and
>>> >> > I
>>> >> > >>> > > am not thinking clearly.
>>> >> > >>> > >
>>> >> > >>> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <
>>> >> andrew@andrewash.com>
>>> >> > >>> > wrote:
>>> >> > >>> > >> Sounds like the problem is that classloaders always look in
>>> >> their
>>> >> > >>> > parents
>>> >> > >>> > >> before themselves, and Spark users want executors to pick up
>>> >> > classes
>>> >> > >>> > from
>>> >> > >>> > >> their custom code before the ones in Spark plus its
>>> >> dependencies.
>>> >> > >>> > >>
>>> >> > >>> > >> Would a custom classloader that delegates to the parent after
>>> >> > first
>>> >> > >>> > >> checking itself fix this up?
>>> >> > >>> > >>
>>> >> > >>> > >>
>>> >> > >>> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <
>>> dbtsai@stanford.edu>
>>> >> > >>> wrote:
>>> >> > >>> > >>
>>> >> > >>> > >>> Hi Sean,
>>> >> > >>> > >>>
>>> >> > >>> > >>> It's true that the issue here is classloader, and due to the
>>> >> > >>> > classloader
>>> >> > >>> > >>> delegation model, users have to use reflection in the
>>> executors
>>> >> > to
>>> >> > >>> > pick up
>>> >> > >>> > >>> the classloader in order to use those classes added by
>>> >> sc.addJars
>>> >> > >>> APIs.
>>> >> > >>> > >>> However, it's very inconvenience for users, and not
>>> documented
>>> >> in
>>> >> > >>> > spark.
>>> >> > >>> > >>>
>>> >> > >>> > >>> I'm working on a patch to solve it by calling the protected
>>> >> > method
>>> >> > >>> > addURL
>>> >> > >>> > >>> in URLClassLoader to update the current default
>>> classloader, so
>>> >> > no
>>> >> > >>> > >>> customClassLoader anymore. I wonder if this is an good way
>>> to
>>> >> go.
>>> >> > >>> > >>>
>>> >> > >>> > >>>   private def addURL(url: URL, loader: URLClassLoader){
>>> >> > >>> > >>>     try {
>>> >> > >>> > >>>       val method: Method =
>>> >> > >>> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL",
>>> >> classOf[URL])
>>> >> > >>> > >>>       method.setAccessible(true)
>>> >> > >>> > >>>       method.invoke(loader, url)
>>> >> > >>> > >>>     }
>>> >> > >>> > >>>     catch {
>>> >> > >>> > >>>       case t: Throwable => {
>>> >> > >>> > >>>         throw new IOException("Error, could not add URL to
>>> >> system
>>> >> > >>> > >>> classloader")
>>> >> > >>> > >>>       }
>>> >> > >>> > >>>     }
>>> >> > >>> > >>>   }
>>> >> > >>> > >>>
>>> >> > >>> > >>>
>>> >> > >>> > >>>
>>> >> > >>> > >>> Sincerely,
>>> >> > >>> > >>>
>>> >> > >>> > >>> DB Tsai
>>> >> > >>> > >>> -------------------------------------------------------
>>> >> > >>> > >>> My Blog: https://www.dbtsai.com
>>> >> > >>> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>> >> > >>> > >>>
>>> >> > >>> > >>>
>>> >> > >>> > >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <
>>> >> sowen@cloudera.com>
>>> >> > >>> > wrote:
>>> >> > >>> > >>>
>>> >> > >>> > >>> > I might be stating the obvious for everyone, but the issue
>>> >> > here is
>>> >> > >>> > not
>>> >> > >>> > >>> > reflection or the source of the JAR, but the ClassLoader.
>>> The
>>> >> > >>> basic
>>> >> > >>> > >>> > rules are this.
>>> >> > >>> > >>> >
>>> >> > >>> > >>> > "new Foo" will use the ClassLoader that defines Foo. This
>>> is
>>> >> > >>> usually
>>> >> > >>> > >>> > the ClassLoader that loaded whatever it is that first
>>> >> > referenced
>>> >> > >>> Foo
>>> >> > >>> > >>> > and caused it to be loaded -- usually the ClassLoader
>>> holding
>>> >> > your
>>> >> > >>> > >>> > other app classes.
>>> >> > >>> > >>> >
>>> >> > >>> > >>> > ClassLoaders can have a parent-child relationship.
>>> >> ClassLoaders
>>> >> > >>> > always
>>> >> > >>> > >>> > look in their parent before themselves.
>>> >> > >>> > >>> >
>>> >> > >>> > >>> > (Careful then -- in contexts like Hadoop or Tomcat where
>>> your
>>> >> > app
>>> >> > >>> is
>>> >> > >>> > >>> > loaded in a child ClassLoader, and you reference a class
>>> that
>>> >> > >>> Hadoop
>>> >> > >>> > >>> > or Tomcat also has (like a lib class) you will get the
>>> >> > container's
>>> >> > >>> > >>> > version!)
>>> >> > >>> > >>> >
>>> >> > >>> > >>> > When you load an external JAR it has a separate
>>> ClassLoader
>>> >> > which
>>> >> > >>> > does
>>> >> > >>> > >>> > not necessarily bear any relation to the one containing
>>> your
>>> >> > app
>>> >> > >>> > >>> > classes, so yeah it is not generally going to make "new
>>> Foo"
>>> >> > work.
>>> >> > >>> > >>> >
>>> >> > >>> > >>> > Reflection lets you pick the ClassLoader, yes.
>>> >> > >>> > >>> >
>>> >> > >>> > >>> > I would not call setContextClassLoader.
>>> >> > >>> > >>> >
>>> >> > >>> > >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <
>>> >> > >>> > sandy.ryza@cloudera.com>
>>> >> > >>> > >>> > wrote:
>>> >> > >>> > >>> > > I spoke with DB offline about this a little while ago
>>> and
>>> >> he
>>> >> > >>> > confirmed
>>> >> > >>> > >>> > that
>>> >> > >>> > >>> > > he was able to access the jar from the driver.
>>> >> > >>> > >>> > >
>>> >> > >>> > >>> > > The issue appears to be a general Java issue: you can't
>>> >> > directly
>>> >> > >>> > >>> > > instantiate a class from a dynamically loaded jar.
>>> >> > >>> > >>> > >
>>> >> > >>> > >>> > > I reproduced it locally outside of Spark with:
>>> >> > >>> > >>> > > ---
>>> >> > >>> > >>> > >     URLClassLoader urlClassLoader = new
>>> URLClassLoader(new
>>> >> > >>> URL[] {
>>> >> > >>> > new
>>> >> > >>> > >>> > > File("myotherjar.jar").toURI().toURL() }, null);
>>> >> > >>> > >>> > >
>>> >> > >>> Thread.currentThread().setContextClassLoader(urlClassLoader);
>>> >> > >>> > >>> > >     MyClassFromMyOtherJar obj = new
>>> >> MyClassFromMyOtherJar();
>>> >> > >>> > >>> > > ---
>>> >> > >>> > >>> > >
>>> >> > >>> > >>> > > I was able to load the class with reflection.
>>> >> > >>> > >>> >
>>> >> > >>> > >>>
>>> >> > >>> >
>>> >> > >>>
>>> >> > >>
>>> >> > >>
>>> >> >
>>> >>
>>>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Xiangrui Meng <me...@gmail.com>.
That's a good example. If we really want to cover that case, there are
two solutions:

1. Follow DB's patch, adding jars to the system classloader. Then we
cannot put a user class in front of an existing class.
2. Do not send the primary jar and secondary jars to executors'
distributed cache. Instead, add them to "spark.jars" in SparkSubmit
and serve them via http by called sc.addJar in SparkContext.

What is your preference?

On Wed, May 21, 2014 at 2:27 PM, Sandy Ryza <sa...@cloudera.com> wrote:
> Is that an assumption we can make?  I think we'd run into an issue in this
> situation:
>
> *In primary jar:*
> def makeDynamicObject(clazz: String) = Class.forName(clazz).newInstance()
>
> *In app code:*
> sc.addJar("dynamicjar.jar")
> ...
> rdd.map(x => makeDynamicObject("some.class.from.DynamicJar"))
>
> It might be fair to say that the user should make sure to use the context
> classloader when instantiating dynamic classes, but I think it's weird that
> this code would work on Spark standalone but not on YARN.
>
> -Sandy
>
>
> On Wed, May 21, 2014 at 2:10 PM, Xiangrui Meng <me...@gmail.com> wrote:
>
>> I think adding jars dynamically should work as long as the primary jar
>> and the secondary jars do not depend on dynamically added jars, which
>> should be the correct logic. -Xiangrui
>>
>> On Wed, May 21, 2014 at 1:40 PM, DB Tsai <db...@stanford.edu> wrote:
>> > This will be another separate story.
>> >
>> > Since in the yarn deployment, as Sandy said, the app.jar will be always
>> in
>> > the systemclassloader which means any object instantiated in app.jar will
>> > have parent loader of systemclassloader instead of custom one. As a
>> result,
>> > the custom class loader in yarn will never work without specifically
>> using
>> > reflection.
>> >
>> > Solution will be not using system classloader in the classloader
>> hierarchy,
>> > and add all the resources in system one into custom one. This is the
>> > approach of tomcat takes.
>> >
>> > Or we can directly overwirte the system class loader by calling the
>> > protected method `addURL` which will not work and throw exception if the
>> > code is wrapped in security manager.
>> >
>> >
>> > Sincerely,
>> >
>> > DB Tsai
>> > -------------------------------------------------------
>> > My Blog: https://www.dbtsai.com
>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>> >
>> >
>> > On Wed, May 21, 2014 at 1:13 PM, Sandy Ryza <sa...@cloudera.com>
>> wrote:
>> >
>> >> This will solve the issue for jars added upon application submission,
>> but,
>> >> on top of this, we need to make sure that anything dynamically added
>> >> through sc.addJar works as well.
>> >>
>> >> To do so, we need to make sure that any jars retrieved via the driver's
>> >> HTTP server are loaded by the same classloader that loads the jars
>> given on
>> >> app submission.  To achieve this, we need to either use the same
>> >> classloader for both system jars and user jars, or make sure that the
>> user
>> >> jars given on app submission are under the same classloader used for
>> >> dynamically added jars.
>> >>
>> >> On Tue, May 20, 2014 at 5:59 PM, Xiangrui Meng <me...@gmail.com>
>> wrote:
>> >>
>> >> > Talked with Sandy and DB offline. I think the best solution is sending
>> >> > the secondary jars to the distributed cache of all containers rather
>> >> > than just the master, and set the classpath to include spark jar,
>> >> > primary app jar, and secondary jars before executor starts. In this
>> >> > way, user only needs to specify secondary jars via --jars instead of
>> >> > calling sc.addJar inside the code. It also solves the scalability
>> >> > problem of serving all the jars via http.
>> >> >
>> >> > If this solution sounds good, I can try to make a patch.
>> >> >
>> >> > Best,
>> >> > Xiangrui
>> >> >
>> >> > On Mon, May 19, 2014 at 10:04 PM, DB Tsai <db...@stanford.edu>
>> wrote:
>> >> > > In 1.0, there is a new option for users to choose which classloader
>> has
>> >> > > higher priority via spark.files.userClassPathFirst, I decided to
>> submit
>> >> > the
>> >> > > PR for 0.9 first. We use this patch in our lab and we can use those
>> >> jars
>> >> > > added by sc.addJar without reflection.
>> >> > >
>> >> > > https://github.com/apache/spark/pull/834
>> >> > >
>> >> > > Can anyone comment if it's a good approach?
>> >> > >
>> >> > > Thanks.
>> >> > >
>> >> > >
>> >> > > Sincerely,
>> >> > >
>> >> > > DB Tsai
>> >> > > -------------------------------------------------------
>> >> > > My Blog: https://www.dbtsai.com
>> >> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>> >> > >
>> >> > >
>> >> > > On Mon, May 19, 2014 at 7:42 PM, DB Tsai <db...@stanford.edu>
>> wrote:
>> >> > >
>> >> > >> Good summary! We fixed it in branch 0.9 since our production is
>> still
>> >> in
>> >> > >> 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for
>> 1.0
>> >> > >> tonight.
>> >> > >>
>> >> > >>
>> >> > >> Sincerely,
>> >> > >>
>> >> > >> DB Tsai
>> >> > >> -------------------------------------------------------
>> >> > >> My Blog: https://www.dbtsai.com
>> >> > >> LinkedIn: https://www.linkedin.com/in/dbtsai
>> >> > >>
>> >> > >>
>> >> > >> On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <
>> sandy.ryza@cloudera.com
>> >> > >wrote:
>> >> > >>
>> >> > >>> It just hit me why this problem is showing up on YARN and not on
>> >> > >>> standalone.
>> >> > >>>
>> >> > >>> The relevant difference between YARN and standalone is that, on
>> YARN,
>> >> > the
>> >> > >>> app jar is loaded by the system classloader instead of Spark's
>> custom
>> >> > URL
>> >> > >>> classloader.
>> >> > >>>
>> >> > >>> On YARN, the system classloader knows about [the classes in the
>> spark
>> >> > >>> jars,
>> >> > >>> the classes in the primary app jar].   The custom classloader
>> knows
>> >> > about
>> >> > >>> [the classes in secondary app jars] and has the system
>> classloader as
>> >> > its
>> >> > >>> parent.
>> >> > >>>
>> >> > >>> A few relevant facts (mostly redundant with what Sean pointed
>> out):
>> >> > >>> * Every class has a classloader that loaded it.
>> >> > >>> * When an object of class B is instantiated inside of class A, the
>> >> > >>> classloader used for loading B is the classloader that was used
>> for
>> >> > >>> loading
>> >> > >>> A.
>> >> > >>> * When a classloader fails to load a class, it lets its parent
>> >> > classloader
>> >> > >>> try.  If its parent succeeds, its parent becomes the "classloader
>> >> that
>> >> > >>> loaded it".
>> >> > >>>
>> >> > >>> So suppose class B is in a secondary app jar and class A is in the
>> >> > primary
>> >> > >>> app jar:
>> >> > >>> 1. The custom classloader will try to load class A.
>> >> > >>> 2. It will fail, because it only knows about the secondary jars.
>> >> > >>> 3. It will delegate to its parent, the system classloader.
>> >> > >>> 4. The system classloader will succeed, because it knows about the
>> >> > primary
>> >> > >>> app jar.
>> >> > >>> 5. A's classloader will be the system classloader.
>> >> > >>> 6. A tries to instantiate an instance of class B.
>> >> > >>> 7. B will be loaded with A's classloader, which is the system
>> >> > classloader.
>> >> > >>> 8. Loading B will fail, because A's classloader, which is the
>> system
>> >> > >>> classloader, doesn't know about the secondary app jars.
>> >> > >>>
>> >> > >>> In Spark standalone, A and B are both loaded by the custom
>> >> > classloader, so
>> >> > >>> this issue doesn't come up.
>> >> > >>>
>> >> > >>> -Sandy
>> >> > >>>
>> >> > >>> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <
>> pwendell@gmail.com
>> >> >
>> >> > >>> wrote:
>> >> > >>>
>> >> > >>> > Having a user add define a custom class inside of an added jar
>> and
>> >> > >>> > instantiate it directly inside of an executor is definitely
>> >> supported
>> >> > >>> > in Spark and has been for a really long time (several years).
>> This
>> >> is
>> >> > >>> > something we do all the time in Spark.
>> >> > >>> >
>> >> > >>> > DB - I'd hold off on a re-architecting of this until we identify
>> >> > >>> > exactly what is causing the bug you are running into.
>> >> > >>> >
>> >> > >>> > In a nutshell, when the bytecode "new Foo()" is run on the
>> >> executor,
>> >> > >>> > it will ask the driver for the class over HTTP using a custom
>> >> > >>> > classloader. Something in that pipeline is breaking here,
>> possibly
>> >> > >>> > related to the YARN deployment stuff.
>> >> > >>> >
>> >> > >>> >
>> >> > >>> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <sowen@cloudera.com
>> >
>> >> > wrote:
>> >> > >>> > > I don't think a customer classloader is necessary.
>> >> > >>> > >
>> >> > >>> > > Well, it occurs to me that this is no new problem. Hadoop,
>> >> Tomcat,
>> >> > etc
>> >> > >>> > > all run custom user code that creates new user objects without
>> >> > >>> > > reflection. I should go see how that's done. Maybe it's
>> totally
>> >> > valid
>> >> > >>> > > to set the thread's context classloader for just this purpose,
>> >> and
>> >> > I
>> >> > >>> > > am not thinking clearly.
>> >> > >>> > >
>> >> > >>> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <
>> >> andrew@andrewash.com>
>> >> > >>> > wrote:
>> >> > >>> > >> Sounds like the problem is that classloaders always look in
>> >> their
>> >> > >>> > parents
>> >> > >>> > >> before themselves, and Spark users want executors to pick up
>> >> > classes
>> >> > >>> > from
>> >> > >>> > >> their custom code before the ones in Spark plus its
>> >> dependencies.
>> >> > >>> > >>
>> >> > >>> > >> Would a custom classloader that delegates to the parent after
>> >> > first
>> >> > >>> > >> checking itself fix this up?
>> >> > >>> > >>
>> >> > >>> > >>
>> >> > >>> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <
>> dbtsai@stanford.edu>
>> >> > >>> wrote:
>> >> > >>> > >>
>> >> > >>> > >>> Hi Sean,
>> >> > >>> > >>>
>> >> > >>> > >>> It's true that the issue here is classloader, and due to the
>> >> > >>> > classloader
>> >> > >>> > >>> delegation model, users have to use reflection in the
>> executors
>> >> > to
>> >> > >>> > pick up
>> >> > >>> > >>> the classloader in order to use those classes added by
>> >> sc.addJars
>> >> > >>> APIs.
>> >> > >>> > >>> However, it's very inconvenience for users, and not
>> documented
>> >> in
>> >> > >>> > spark.
>> >> > >>> > >>>
>> >> > >>> > >>> I'm working on a patch to solve it by calling the protected
>> >> > method
>> >> > >>> > addURL
>> >> > >>> > >>> in URLClassLoader to update the current default
>> classloader, so
>> >> > no
>> >> > >>> > >>> customClassLoader anymore. I wonder if this is an good way
>> to
>> >> go.
>> >> > >>> > >>>
>> >> > >>> > >>>   private def addURL(url: URL, loader: URLClassLoader){
>> >> > >>> > >>>     try {
>> >> > >>> > >>>       val method: Method =
>> >> > >>> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL",
>> >> classOf[URL])
>> >> > >>> > >>>       method.setAccessible(true)
>> >> > >>> > >>>       method.invoke(loader, url)
>> >> > >>> > >>>     }
>> >> > >>> > >>>     catch {
>> >> > >>> > >>>       case t: Throwable => {
>> >> > >>> > >>>         throw new IOException("Error, could not add URL to
>> >> system
>> >> > >>> > >>> classloader")
>> >> > >>> > >>>       }
>> >> > >>> > >>>     }
>> >> > >>> > >>>   }
>> >> > >>> > >>>
>> >> > >>> > >>>
>> >> > >>> > >>>
>> >> > >>> > >>> Sincerely,
>> >> > >>> > >>>
>> >> > >>> > >>> DB Tsai
>> >> > >>> > >>> -------------------------------------------------------
>> >> > >>> > >>> My Blog: https://www.dbtsai.com
>> >> > >>> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai
>> >> > >>> > >>>
>> >> > >>> > >>>
>> >> > >>> > >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <
>> >> sowen@cloudera.com>
>> >> > >>> > wrote:
>> >> > >>> > >>>
>> >> > >>> > >>> > I might be stating the obvious for everyone, but the issue
>> >> > here is
>> >> > >>> > not
>> >> > >>> > >>> > reflection or the source of the JAR, but the ClassLoader.
>> The
>> >> > >>> basic
>> >> > >>> > >>> > rules are this.
>> >> > >>> > >>> >
>> >> > >>> > >>> > "new Foo" will use the ClassLoader that defines Foo. This
>> is
>> >> > >>> usually
>> >> > >>> > >>> > the ClassLoader that loaded whatever it is that first
>> >> > referenced
>> >> > >>> Foo
>> >> > >>> > >>> > and caused it to be loaded -- usually the ClassLoader
>> holding
>> >> > your
>> >> > >>> > >>> > other app classes.
>> >> > >>> > >>> >
>> >> > >>> > >>> > ClassLoaders can have a parent-child relationship.
>> >> ClassLoaders
>> >> > >>> > always
>> >> > >>> > >>> > look in their parent before themselves.
>> >> > >>> > >>> >
>> >> > >>> > >>> > (Careful then -- in contexts like Hadoop or Tomcat where
>> your
>> >> > app
>> >> > >>> is
>> >> > >>> > >>> > loaded in a child ClassLoader, and you reference a class
>> that
>> >> > >>> Hadoop
>> >> > >>> > >>> > or Tomcat also has (like a lib class) you will get the
>> >> > container's
>> >> > >>> > >>> > version!)
>> >> > >>> > >>> >
>> >> > >>> > >>> > When you load an external JAR it has a separate
>> ClassLoader
>> >> > which
>> >> > >>> > does
>> >> > >>> > >>> > not necessarily bear any relation to the one containing
>> your
>> >> > app
>> >> > >>> > >>> > classes, so yeah it is not generally going to make "new
>> Foo"
>> >> > work.
>> >> > >>> > >>> >
>> >> > >>> > >>> > Reflection lets you pick the ClassLoader, yes.
>> >> > >>> > >>> >
>> >> > >>> > >>> > I would not call setContextClassLoader.
>> >> > >>> > >>> >
>> >> > >>> > >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <
>> >> > >>> > sandy.ryza@cloudera.com>
>> >> > >>> > >>> > wrote:
>> >> > >>> > >>> > > I spoke with DB offline about this a little while ago
>> and
>> >> he
>> >> > >>> > confirmed
>> >> > >>> > >>> > that
>> >> > >>> > >>> > > he was able to access the jar from the driver.
>> >> > >>> > >>> > >
>> >> > >>> > >>> > > The issue appears to be a general Java issue: you can't
>> >> > directly
>> >> > >>> > >>> > > instantiate a class from a dynamically loaded jar.
>> >> > >>> > >>> > >
>> >> > >>> > >>> > > I reproduced it locally outside of Spark with:
>> >> > >>> > >>> > > ---
>> >> > >>> > >>> > >     URLClassLoader urlClassLoader = new
>> URLClassLoader(new
>> >> > >>> URL[] {
>> >> > >>> > new
>> >> > >>> > >>> > > File("myotherjar.jar").toURI().toURL() }, null);
>> >> > >>> > >>> > >
>> >> > >>> Thread.currentThread().setContextClassLoader(urlClassLoader);
>> >> > >>> > >>> > >     MyClassFromMyOtherJar obj = new
>> >> MyClassFromMyOtherJar();
>> >> > >>> > >>> > > ---
>> >> > >>> > >>> > >
>> >> > >>> > >>> > > I was able to load the class with reflection.
>> >> > >>> > >>> >
>> >> > >>> > >>>
>> >> > >>> >
>> >> > >>>
>> >> > >>
>> >> > >>
>> >> >
>> >>
>>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Sandy Ryza <sa...@cloudera.com>.
Is that an assumption we can make?  I think we'd run into an issue in this
situation:

*In primary jar:*
def makeDynamicObject(clazz: String) = Class.forName(clazz).newInstance()

*In app code:*
sc.addJar("dynamicjar.jar")
...
rdd.map(x => makeDynamicObject("some.class.from.DynamicJar"))

It might be fair to say that the user should make sure to use the context
classloader when instantiating dynamic classes, but I think it's weird that
this code would work on Spark standalone but not on YARN.

-Sandy


On Wed, May 21, 2014 at 2:10 PM, Xiangrui Meng <me...@gmail.com> wrote:

> I think adding jars dynamically should work as long as the primary jar
> and the secondary jars do not depend on dynamically added jars, which
> should be the correct logic. -Xiangrui
>
> On Wed, May 21, 2014 at 1:40 PM, DB Tsai <db...@stanford.edu> wrote:
> > This will be another separate story.
> >
> > Since in the yarn deployment, as Sandy said, the app.jar will be always
> in
> > the systemclassloader which means any object instantiated in app.jar will
> > have parent loader of systemclassloader instead of custom one. As a
> result,
> > the custom class loader in yarn will never work without specifically
> using
> > reflection.
> >
> > Solution will be not using system classloader in the classloader
> hierarchy,
> > and add all the resources in system one into custom one. This is the
> > approach of tomcat takes.
> >
> > Or we can directly overwirte the system class loader by calling the
> > protected method `addURL` which will not work and throw exception if the
> > code is wrapped in security manager.
> >
> >
> > Sincerely,
> >
> > DB Tsai
> > -------------------------------------------------------
> > My Blog: https://www.dbtsai.com
> > LinkedIn: https://www.linkedin.com/in/dbtsai
> >
> >
> > On Wed, May 21, 2014 at 1:13 PM, Sandy Ryza <sa...@cloudera.com>
> wrote:
> >
> >> This will solve the issue for jars added upon application submission,
> but,
> >> on top of this, we need to make sure that anything dynamically added
> >> through sc.addJar works as well.
> >>
> >> To do so, we need to make sure that any jars retrieved via the driver's
> >> HTTP server are loaded by the same classloader that loads the jars
> given on
> >> app submission.  To achieve this, we need to either use the same
> >> classloader for both system jars and user jars, or make sure that the
> user
> >> jars given on app submission are under the same classloader used for
> >> dynamically added jars.
> >>
> >> On Tue, May 20, 2014 at 5:59 PM, Xiangrui Meng <me...@gmail.com>
> wrote:
> >>
> >> > Talked with Sandy and DB offline. I think the best solution is sending
> >> > the secondary jars to the distributed cache of all containers rather
> >> > than just the master, and set the classpath to include spark jar,
> >> > primary app jar, and secondary jars before executor starts. In this
> >> > way, user only needs to specify secondary jars via --jars instead of
> >> > calling sc.addJar inside the code. It also solves the scalability
> >> > problem of serving all the jars via http.
> >> >
> >> > If this solution sounds good, I can try to make a patch.
> >> >
> >> > Best,
> >> > Xiangrui
> >> >
> >> > On Mon, May 19, 2014 at 10:04 PM, DB Tsai <db...@stanford.edu>
> wrote:
> >> > > In 1.0, there is a new option for users to choose which classloader
> has
> >> > > higher priority via spark.files.userClassPathFirst, I decided to
> submit
> >> > the
> >> > > PR for 0.9 first. We use this patch in our lab and we can use those
> >> jars
> >> > > added by sc.addJar without reflection.
> >> > >
> >> > > https://github.com/apache/spark/pull/834
> >> > >
> >> > > Can anyone comment if it's a good approach?
> >> > >
> >> > > Thanks.
> >> > >
> >> > >
> >> > > Sincerely,
> >> > >
> >> > > DB Tsai
> >> > > -------------------------------------------------------
> >> > > My Blog: https://www.dbtsai.com
> >> > > LinkedIn: https://www.linkedin.com/in/dbtsai
> >> > >
> >> > >
> >> > > On Mon, May 19, 2014 at 7:42 PM, DB Tsai <db...@stanford.edu>
> wrote:
> >> > >
> >> > >> Good summary! We fixed it in branch 0.9 since our production is
> still
> >> in
> >> > >> 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for
> 1.0
> >> > >> tonight.
> >> > >>
> >> > >>
> >> > >> Sincerely,
> >> > >>
> >> > >> DB Tsai
> >> > >> -------------------------------------------------------
> >> > >> My Blog: https://www.dbtsai.com
> >> > >> LinkedIn: https://www.linkedin.com/in/dbtsai
> >> > >>
> >> > >>
> >> > >> On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <
> sandy.ryza@cloudera.com
> >> > >wrote:
> >> > >>
> >> > >>> It just hit me why this problem is showing up on YARN and not on
> >> > >>> standalone.
> >> > >>>
> >> > >>> The relevant difference between YARN and standalone is that, on
> YARN,
> >> > the
> >> > >>> app jar is loaded by the system classloader instead of Spark's
> custom
> >> > URL
> >> > >>> classloader.
> >> > >>>
> >> > >>> On YARN, the system classloader knows about [the classes in the
> spark
> >> > >>> jars,
> >> > >>> the classes in the primary app jar].   The custom classloader
> knows
> >> > about
> >> > >>> [the classes in secondary app jars] and has the system
> classloader as
> >> > its
> >> > >>> parent.
> >> > >>>
> >> > >>> A few relevant facts (mostly redundant with what Sean pointed
> out):
> >> > >>> * Every class has a classloader that loaded it.
> >> > >>> * When an object of class B is instantiated inside of class A, the
> >> > >>> classloader used for loading B is the classloader that was used
> for
> >> > >>> loading
> >> > >>> A.
> >> > >>> * When a classloader fails to load a class, it lets its parent
> >> > classloader
> >> > >>> try.  If its parent succeeds, its parent becomes the "classloader
> >> that
> >> > >>> loaded it".
> >> > >>>
> >> > >>> So suppose class B is in a secondary app jar and class A is in the
> >> > primary
> >> > >>> app jar:
> >> > >>> 1. The custom classloader will try to load class A.
> >> > >>> 2. It will fail, because it only knows about the secondary jars.
> >> > >>> 3. It will delegate to its parent, the system classloader.
> >> > >>> 4. The system classloader will succeed, because it knows about the
> >> > primary
> >> > >>> app jar.
> >> > >>> 5. A's classloader will be the system classloader.
> >> > >>> 6. A tries to instantiate an instance of class B.
> >> > >>> 7. B will be loaded with A's classloader, which is the system
> >> > classloader.
> >> > >>> 8. Loading B will fail, because A's classloader, which is the
> system
> >> > >>> classloader, doesn't know about the secondary app jars.
> >> > >>>
> >> > >>> In Spark standalone, A and B are both loaded by the custom
> >> > classloader, so
> >> > >>> this issue doesn't come up.
> >> > >>>
> >> > >>> -Sandy
> >> > >>>
> >> > >>> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <
> pwendell@gmail.com
> >> >
> >> > >>> wrote:
> >> > >>>
> >> > >>> > Having a user add define a custom class inside of an added jar
> and
> >> > >>> > instantiate it directly inside of an executor is definitely
> >> supported
> >> > >>> > in Spark and has been for a really long time (several years).
> This
> >> is
> >> > >>> > something we do all the time in Spark.
> >> > >>> >
> >> > >>> > DB - I'd hold off on a re-architecting of this until we identify
> >> > >>> > exactly what is causing the bug you are running into.
> >> > >>> >
> >> > >>> > In a nutshell, when the bytecode "new Foo()" is run on the
> >> executor,
> >> > >>> > it will ask the driver for the class over HTTP using a custom
> >> > >>> > classloader. Something in that pipeline is breaking here,
> possibly
> >> > >>> > related to the YARN deployment stuff.
> >> > >>> >
> >> > >>> >
> >> > >>> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <sowen@cloudera.com
> >
> >> > wrote:
> >> > >>> > > I don't think a customer classloader is necessary.
> >> > >>> > >
> >> > >>> > > Well, it occurs to me that this is no new problem. Hadoop,
> >> Tomcat,
> >> > etc
> >> > >>> > > all run custom user code that creates new user objects without
> >> > >>> > > reflection. I should go see how that's done. Maybe it's
> totally
> >> > valid
> >> > >>> > > to set the thread's context classloader for just this purpose,
> >> and
> >> > I
> >> > >>> > > am not thinking clearly.
> >> > >>> > >
> >> > >>> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <
> >> andrew@andrewash.com>
> >> > >>> > wrote:
> >> > >>> > >> Sounds like the problem is that classloaders always look in
> >> their
> >> > >>> > parents
> >> > >>> > >> before themselves, and Spark users want executors to pick up
> >> > classes
> >> > >>> > from
> >> > >>> > >> their custom code before the ones in Spark plus its
> >> dependencies.
> >> > >>> > >>
> >> > >>> > >> Would a custom classloader that delegates to the parent after
> >> > first
> >> > >>> > >> checking itself fix this up?
> >> > >>> > >>
> >> > >>> > >>
> >> > >>> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <
> dbtsai@stanford.edu>
> >> > >>> wrote:
> >> > >>> > >>
> >> > >>> > >>> Hi Sean,
> >> > >>> > >>>
> >> > >>> > >>> It's true that the issue here is classloader, and due to the
> >> > >>> > classloader
> >> > >>> > >>> delegation model, users have to use reflection in the
> executors
> >> > to
> >> > >>> > pick up
> >> > >>> > >>> the classloader in order to use those classes added by
> >> sc.addJars
> >> > >>> APIs.
> >> > >>> > >>> However, it's very inconvenience for users, and not
> documented
> >> in
> >> > >>> > spark.
> >> > >>> > >>>
> >> > >>> > >>> I'm working on a patch to solve it by calling the protected
> >> > method
> >> > >>> > addURL
> >> > >>> > >>> in URLClassLoader to update the current default
> classloader, so
> >> > no
> >> > >>> > >>> customClassLoader anymore. I wonder if this is an good way
> to
> >> go.
> >> > >>> > >>>
> >> > >>> > >>>   private def addURL(url: URL, loader: URLClassLoader){
> >> > >>> > >>>     try {
> >> > >>> > >>>       val method: Method =
> >> > >>> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL",
> >> classOf[URL])
> >> > >>> > >>>       method.setAccessible(true)
> >> > >>> > >>>       method.invoke(loader, url)
> >> > >>> > >>>     }
> >> > >>> > >>>     catch {
> >> > >>> > >>>       case t: Throwable => {
> >> > >>> > >>>         throw new IOException("Error, could not add URL to
> >> system
> >> > >>> > >>> classloader")
> >> > >>> > >>>       }
> >> > >>> > >>>     }
> >> > >>> > >>>   }
> >> > >>> > >>>
> >> > >>> > >>>
> >> > >>> > >>>
> >> > >>> > >>> Sincerely,
> >> > >>> > >>>
> >> > >>> > >>> DB Tsai
> >> > >>> > >>> -------------------------------------------------------
> >> > >>> > >>> My Blog: https://www.dbtsai.com
> >> > >>> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai
> >> > >>> > >>>
> >> > >>> > >>>
> >> > >>> > >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <
> >> sowen@cloudera.com>
> >> > >>> > wrote:
> >> > >>> > >>>
> >> > >>> > >>> > I might be stating the obvious for everyone, but the issue
> >> > here is
> >> > >>> > not
> >> > >>> > >>> > reflection or the source of the JAR, but the ClassLoader.
> The
> >> > >>> basic
> >> > >>> > >>> > rules are this.
> >> > >>> > >>> >
> >> > >>> > >>> > "new Foo" will use the ClassLoader that defines Foo. This
> is
> >> > >>> usually
> >> > >>> > >>> > the ClassLoader that loaded whatever it is that first
> >> > referenced
> >> > >>> Foo
> >> > >>> > >>> > and caused it to be loaded -- usually the ClassLoader
> holding
> >> > your
> >> > >>> > >>> > other app classes.
> >> > >>> > >>> >
> >> > >>> > >>> > ClassLoaders can have a parent-child relationship.
> >> ClassLoaders
> >> > >>> > always
> >> > >>> > >>> > look in their parent before themselves.
> >> > >>> > >>> >
> >> > >>> > >>> > (Careful then -- in contexts like Hadoop or Tomcat where
> your
> >> > app
> >> > >>> is
> >> > >>> > >>> > loaded in a child ClassLoader, and you reference a class
> that
> >> > >>> Hadoop
> >> > >>> > >>> > or Tomcat also has (like a lib class) you will get the
> >> > container's
> >> > >>> > >>> > version!)
> >> > >>> > >>> >
> >> > >>> > >>> > When you load an external JAR it has a separate
> ClassLoader
> >> > which
> >> > >>> > does
> >> > >>> > >>> > not necessarily bear any relation to the one containing
> your
> >> > app
> >> > >>> > >>> > classes, so yeah it is not generally going to make "new
> Foo"
> >> > work.
> >> > >>> > >>> >
> >> > >>> > >>> > Reflection lets you pick the ClassLoader, yes.
> >> > >>> > >>> >
> >> > >>> > >>> > I would not call setContextClassLoader.
> >> > >>> > >>> >
> >> > >>> > >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <
> >> > >>> > sandy.ryza@cloudera.com>
> >> > >>> > >>> > wrote:
> >> > >>> > >>> > > I spoke with DB offline about this a little while ago
> and
> >> he
> >> > >>> > confirmed
> >> > >>> > >>> > that
> >> > >>> > >>> > > he was able to access the jar from the driver.
> >> > >>> > >>> > >
> >> > >>> > >>> > > The issue appears to be a general Java issue: you can't
> >> > directly
> >> > >>> > >>> > > instantiate a class from a dynamically loaded jar.
> >> > >>> > >>> > >
> >> > >>> > >>> > > I reproduced it locally outside of Spark with:
> >> > >>> > >>> > > ---
> >> > >>> > >>> > >     URLClassLoader urlClassLoader = new
> URLClassLoader(new
> >> > >>> URL[] {
> >> > >>> > new
> >> > >>> > >>> > > File("myotherjar.jar").toURI().toURL() }, null);
> >> > >>> > >>> > >
> >> > >>> Thread.currentThread().setContextClassLoader(urlClassLoader);
> >> > >>> > >>> > >     MyClassFromMyOtherJar obj = new
> >> MyClassFromMyOtherJar();
> >> > >>> > >>> > > ---
> >> > >>> > >>> > >
> >> > >>> > >>> > > I was able to load the class with reflection.
> >> > >>> > >>> >
> >> > >>> > >>>
> >> > >>> >
> >> > >>>
> >> > >>
> >> > >>
> >> >
> >>
>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by DB Tsai <db...@stanford.edu>.
How about the jars added dynamically? Those will be in customLoader's
classpath but not in the system one. Unfortunately, when we reference to
those jars added dynamically in primary jar, the default classloader will
be the system one not the custom one.

It works in standalone mode since the primary jar is not in the system
loader but custom one, so when we reference those jars added dynamically,
we can find it without reflection.


Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Wed, May 21, 2014 at 2:10 PM, Xiangrui Meng <me...@gmail.com> wrote:

> I think adding jars dynamically should work as long as the primary jar
> and the secondary jars do not depend on dynamically added jars, which
> should be the correct logic. -Xiangrui
>
> On Wed, May 21, 2014 at 1:40 PM, DB Tsai <db...@stanford.edu> wrote:
> > This will be another separate story.
> >
> > Since in the yarn deployment, as Sandy said, the app.jar will be always
> in
> > the systemclassloader which means any object instantiated in app.jar will
> > have parent loader of systemclassloader instead of custom one. As a
> result,
> > the custom class loader in yarn will never work without specifically
> using
> > reflection.
> >
> > Solution will be not using system classloader in the classloader
> hierarchy,
> > and add all the resources in system one into custom one. This is the
> > approach of tomcat takes.
> >
> > Or we can directly overwirte the system class loader by calling the
> > protected method `addURL` which will not work and throw exception if the
> > code is wrapped in security manager.
> >
> >
> > Sincerely,
> >
> > DB Tsai
> > -------------------------------------------------------
> > My Blog: https://www.dbtsai.com
> > LinkedIn: https://www.linkedin.com/in/dbtsai
> >
> >
> > On Wed, May 21, 2014 at 1:13 PM, Sandy Ryza <sa...@cloudera.com>
> wrote:
> >
> >> This will solve the issue for jars added upon application submission,
> but,
> >> on top of this, we need to make sure that anything dynamically added
> >> through sc.addJar works as well.
> >>
> >> To do so, we need to make sure that any jars retrieved via the driver's
> >> HTTP server are loaded by the same classloader that loads the jars
> given on
> >> app submission.  To achieve this, we need to either use the same
> >> classloader for both system jars and user jars, or make sure that the
> user
> >> jars given on app submission are under the same classloader used for
> >> dynamically added jars.
> >>
> >> On Tue, May 20, 2014 at 5:59 PM, Xiangrui Meng <me...@gmail.com>
> wrote:
> >>
> >> > Talked with Sandy and DB offline. I think the best solution is sending
> >> > the secondary jars to the distributed cache of all containers rather
> >> > than just the master, and set the classpath to include spark jar,
> >> > primary app jar, and secondary jars before executor starts. In this
> >> > way, user only needs to specify secondary jars via --jars instead of
> >> > calling sc.addJar inside the code. It also solves the scalability
> >> > problem of serving all the jars via http.
> >> >
> >> > If this solution sounds good, I can try to make a patch.
> >> >
> >> > Best,
> >> > Xiangrui
> >> >
> >> > On Mon, May 19, 2014 at 10:04 PM, DB Tsai <db...@stanford.edu>
> wrote:
> >> > > In 1.0, there is a new option for users to choose which classloader
> has
> >> > > higher priority via spark.files.userClassPathFirst, I decided to
> submit
> >> > the
> >> > > PR for 0.9 first. We use this patch in our lab and we can use those
> >> jars
> >> > > added by sc.addJar without reflection.
> >> > >
> >> > > https://github.com/apache/spark/pull/834
> >> > >
> >> > > Can anyone comment if it's a good approach?
> >> > >
> >> > > Thanks.
> >> > >
> >> > >
> >> > > Sincerely,
> >> > >
> >> > > DB Tsai
> >> > > -------------------------------------------------------
> >> > > My Blog: https://www.dbtsai.com
> >> > > LinkedIn: https://www.linkedin.com/in/dbtsai
> >> > >
> >> > >
> >> > > On Mon, May 19, 2014 at 7:42 PM, DB Tsai <db...@stanford.edu>
> wrote:
> >> > >
> >> > >> Good summary! We fixed it in branch 0.9 since our production is
> still
> >> in
> >> > >> 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for
> 1.0
> >> > >> tonight.
> >> > >>
> >> > >>
> >> > >> Sincerely,
> >> > >>
> >> > >> DB Tsai
> >> > >> -------------------------------------------------------
> >> > >> My Blog: https://www.dbtsai.com
> >> > >> LinkedIn: https://www.linkedin.com/in/dbtsai
> >> > >>
> >> > >>
> >> > >> On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <
> sandy.ryza@cloudera.com
> >> > >wrote:
> >> > >>
> >> > >>> It just hit me why this problem is showing up on YARN and not on
> >> > >>> standalone.
> >> > >>>
> >> > >>> The relevant difference between YARN and standalone is that, on
> YARN,
> >> > the
> >> > >>> app jar is loaded by the system classloader instead of Spark's
> custom
> >> > URL
> >> > >>> classloader.
> >> > >>>
> >> > >>> On YARN, the system classloader knows about [the classes in the
> spark
> >> > >>> jars,
> >> > >>> the classes in the primary app jar].   The custom classloader
> knows
> >> > about
> >> > >>> [the classes in secondary app jars] and has the system
> classloader as
> >> > its
> >> > >>> parent.
> >> > >>>
> >> > >>> A few relevant facts (mostly redundant with what Sean pointed
> out):
> >> > >>> * Every class has a classloader that loaded it.
> >> > >>> * When an object of class B is instantiated inside of class A, the
> >> > >>> classloader used for loading B is the classloader that was used
> for
> >> > >>> loading
> >> > >>> A.
> >> > >>> * When a classloader fails to load a class, it lets its parent
> >> > classloader
> >> > >>> try.  If its parent succeeds, its parent becomes the "classloader
> >> that
> >> > >>> loaded it".
> >> > >>>
> >> > >>> So suppose class B is in a secondary app jar and class A is in the
> >> > primary
> >> > >>> app jar:
> >> > >>> 1. The custom classloader will try to load class A.
> >> > >>> 2. It will fail, because it only knows about the secondary jars.
> >> > >>> 3. It will delegate to its parent, the system classloader.
> >> > >>> 4. The system classloader will succeed, because it knows about the
> >> > primary
> >> > >>> app jar.
> >> > >>> 5. A's classloader will be the system classloader.
> >> > >>> 6. A tries to instantiate an instance of class B.
> >> > >>> 7. B will be loaded with A's classloader, which is the system
> >> > classloader.
> >> > >>> 8. Loading B will fail, because A's classloader, which is the
> system
> >> > >>> classloader, doesn't know about the secondary app jars.
> >> > >>>
> >> > >>> In Spark standalone, A and B are both loaded by the custom
> >> > classloader, so
> >> > >>> this issue doesn't come up.
> >> > >>>
> >> > >>> -Sandy
> >> > >>>
> >> > >>> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <
> pwendell@gmail.com
> >> >
> >> > >>> wrote:
> >> > >>>
> >> > >>> > Having a user add define a custom class inside of an added jar
> and
> >> > >>> > instantiate it directly inside of an executor is definitely
> >> supported
> >> > >>> > in Spark and has been for a really long time (several years).
> This
> >> is
> >> > >>> > something we do all the time in Spark.
> >> > >>> >
> >> > >>> > DB - I'd hold off on a re-architecting of this until we identify
> >> > >>> > exactly what is causing the bug you are running into.
> >> > >>> >
> >> > >>> > In a nutshell, when the bytecode "new Foo()" is run on the
> >> executor,
> >> > >>> > it will ask the driver for the class over HTTP using a custom
> >> > >>> > classloader. Something in that pipeline is breaking here,
> possibly
> >> > >>> > related to the YARN deployment stuff.
> >> > >>> >
> >> > >>> >
> >> > >>> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <sowen@cloudera.com
> >
> >> > wrote:
> >> > >>> > > I don't think a customer classloader is necessary.
> >> > >>> > >
> >> > >>> > > Well, it occurs to me that this is no new problem. Hadoop,
> >> Tomcat,
> >> > etc
> >> > >>> > > all run custom user code that creates new user objects without
> >> > >>> > > reflection. I should go see how that's done. Maybe it's
> totally
> >> > valid
> >> > >>> > > to set the thread's context classloader for just this purpose,
> >> and
> >> > I
> >> > >>> > > am not thinking clearly.
> >> > >>> > >
> >> > >>> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <
> >> andrew@andrewash.com>
> >> > >>> > wrote:
> >> > >>> > >> Sounds like the problem is that classloaders always look in
> >> their
> >> > >>> > parents
> >> > >>> > >> before themselves, and Spark users want executors to pick up
> >> > classes
> >> > >>> > from
> >> > >>> > >> their custom code before the ones in Spark plus its
> >> dependencies.
> >> > >>> > >>
> >> > >>> > >> Would a custom classloader that delegates to the parent after
> >> > first
> >> > >>> > >> checking itself fix this up?
> >> > >>> > >>
> >> > >>> > >>
> >> > >>> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <
> dbtsai@stanford.edu>
> >> > >>> wrote:
> >> > >>> > >>
> >> > >>> > >>> Hi Sean,
> >> > >>> > >>>
> >> > >>> > >>> It's true that the issue here is classloader, and due to the
> >> > >>> > classloader
> >> > >>> > >>> delegation model, users have to use reflection in the
> executors
> >> > to
> >> > >>> > pick up
> >> > >>> > >>> the classloader in order to use those classes added by
> >> sc.addJars
> >> > >>> APIs.
> >> > >>> > >>> However, it's very inconvenience for users, and not
> documented
> >> in
> >> > >>> > spark.
> >> > >>> > >>>
> >> > >>> > >>> I'm working on a patch to solve it by calling the protected
> >> > method
> >> > >>> > addURL
> >> > >>> > >>> in URLClassLoader to update the current default
> classloader, so
> >> > no
> >> > >>> > >>> customClassLoader anymore. I wonder if this is an good way
> to
> >> go.
> >> > >>> > >>>
> >> > >>> > >>>   private def addURL(url: URL, loader: URLClassLoader){
> >> > >>> > >>>     try {
> >> > >>> > >>>       val method: Method =
> >> > >>> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL",
> >> classOf[URL])
> >> > >>> > >>>       method.setAccessible(true)
> >> > >>> > >>>       method.invoke(loader, url)
> >> > >>> > >>>     }
> >> > >>> > >>>     catch {
> >> > >>> > >>>       case t: Throwable => {
> >> > >>> > >>>         throw new IOException("Error, could not add URL to
> >> system
> >> > >>> > >>> classloader")
> >> > >>> > >>>       }
> >> > >>> > >>>     }
> >> > >>> > >>>   }
> >> > >>> > >>>
> >> > >>> > >>>
> >> > >>> > >>>
> >> > >>> > >>> Sincerely,
> >> > >>> > >>>
> >> > >>> > >>> DB Tsai
> >> > >>> > >>> -------------------------------------------------------
> >> > >>> > >>> My Blog: https://www.dbtsai.com
> >> > >>> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai
> >> > >>> > >>>
> >> > >>> > >>>
> >> > >>> > >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <
> >> sowen@cloudera.com>
> >> > >>> > wrote:
> >> > >>> > >>>
> >> > >>> > >>> > I might be stating the obvious for everyone, but the issue
> >> > here is
> >> > >>> > not
> >> > >>> > >>> > reflection or the source of the JAR, but the ClassLoader.
> The
> >> > >>> basic
> >> > >>> > >>> > rules are this.
> >> > >>> > >>> >
> >> > >>> > >>> > "new Foo" will use the ClassLoader that defines Foo. This
> is
> >> > >>> usually
> >> > >>> > >>> > the ClassLoader that loaded whatever it is that first
> >> > referenced
> >> > >>> Foo
> >> > >>> > >>> > and caused it to be loaded -- usually the ClassLoader
> holding
> >> > your
> >> > >>> > >>> > other app classes.
> >> > >>> > >>> >
> >> > >>> > >>> > ClassLoaders can have a parent-child relationship.
> >> ClassLoaders
> >> > >>> > always
> >> > >>> > >>> > look in their parent before themselves.
> >> > >>> > >>> >
> >> > >>> > >>> > (Careful then -- in contexts like Hadoop or Tomcat where
> your
> >> > app
> >> > >>> is
> >> > >>> > >>> > loaded in a child ClassLoader, and you reference a class
> that
> >> > >>> Hadoop
> >> > >>> > >>> > or Tomcat also has (like a lib class) you will get the
> >> > container's
> >> > >>> > >>> > version!)
> >> > >>> > >>> >
> >> > >>> > >>> > When you load an external JAR it has a separate
> ClassLoader
> >> > which
> >> > >>> > does
> >> > >>> > >>> > not necessarily bear any relation to the one containing
> your
> >> > app
> >> > >>> > >>> > classes, so yeah it is not generally going to make "new
> Foo"
> >> > work.
> >> > >>> > >>> >
> >> > >>> > >>> > Reflection lets you pick the ClassLoader, yes.
> >> > >>> > >>> >
> >> > >>> > >>> > I would not call setContextClassLoader.
> >> > >>> > >>> >
> >> > >>> > >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <
> >> > >>> > sandy.ryza@cloudera.com>
> >> > >>> > >>> > wrote:
> >> > >>> > >>> > > I spoke with DB offline about this a little while ago
> and
> >> he
> >> > >>> > confirmed
> >> > >>> > >>> > that
> >> > >>> > >>> > > he was able to access the jar from the driver.
> >> > >>> > >>> > >
> >> > >>> > >>> > > The issue appears to be a general Java issue: you can't
> >> > directly
> >> > >>> > >>> > > instantiate a class from a dynamically loaded jar.
> >> > >>> > >>> > >
> >> > >>> > >>> > > I reproduced it locally outside of Spark with:
> >> > >>> > >>> > > ---
> >> > >>> > >>> > >     URLClassLoader urlClassLoader = new
> URLClassLoader(new
> >> > >>> URL[] {
> >> > >>> > new
> >> > >>> > >>> > > File("myotherjar.jar").toURI().toURL() }, null);
> >> > >>> > >>> > >
> >> > >>> Thread.currentThread().setContextClassLoader(urlClassLoader);
> >> > >>> > >>> > >     MyClassFromMyOtherJar obj = new
> >> MyClassFromMyOtherJar();
> >> > >>> > >>> > > ---
> >> > >>> > >>> > >
> >> > >>> > >>> > > I was able to load the class with reflection.
> >> > >>> > >>> >
> >> > >>> > >>>
> >> > >>> >
> >> > >>>
> >> > >>
> >> > >>
> >> >
> >>
>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Xiangrui Meng <me...@gmail.com>.
I think adding jars dynamically should work as long as the primary jar
and the secondary jars do not depend on dynamically added jars, which
should be the correct logic. -Xiangrui

On Wed, May 21, 2014 at 1:40 PM, DB Tsai <db...@stanford.edu> wrote:
> This will be another separate story.
>
> Since in the yarn deployment, as Sandy said, the app.jar will be always in
> the systemclassloader which means any object instantiated in app.jar will
> have parent loader of systemclassloader instead of custom one. As a result,
> the custom class loader in yarn will never work without specifically using
> reflection.
>
> Solution will be not using system classloader in the classloader hierarchy,
> and add all the resources in system one into custom one. This is the
> approach of tomcat takes.
>
> Or we can directly overwirte the system class loader by calling the
> protected method `addURL` which will not work and throw exception if the
> code is wrapped in security manager.
>
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Wed, May 21, 2014 at 1:13 PM, Sandy Ryza <sa...@cloudera.com> wrote:
>
>> This will solve the issue for jars added upon application submission, but,
>> on top of this, we need to make sure that anything dynamically added
>> through sc.addJar works as well.
>>
>> To do so, we need to make sure that any jars retrieved via the driver's
>> HTTP server are loaded by the same classloader that loads the jars given on
>> app submission.  To achieve this, we need to either use the same
>> classloader for both system jars and user jars, or make sure that the user
>> jars given on app submission are under the same classloader used for
>> dynamically added jars.
>>
>> On Tue, May 20, 2014 at 5:59 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>
>> > Talked with Sandy and DB offline. I think the best solution is sending
>> > the secondary jars to the distributed cache of all containers rather
>> > than just the master, and set the classpath to include spark jar,
>> > primary app jar, and secondary jars before executor starts. In this
>> > way, user only needs to specify secondary jars via --jars instead of
>> > calling sc.addJar inside the code. It also solves the scalability
>> > problem of serving all the jars via http.
>> >
>> > If this solution sounds good, I can try to make a patch.
>> >
>> > Best,
>> > Xiangrui
>> >
>> > On Mon, May 19, 2014 at 10:04 PM, DB Tsai <db...@stanford.edu> wrote:
>> > > In 1.0, there is a new option for users to choose which classloader has
>> > > higher priority via spark.files.userClassPathFirst, I decided to submit
>> > the
>> > > PR for 0.9 first. We use this patch in our lab and we can use those
>> jars
>> > > added by sc.addJar without reflection.
>> > >
>> > > https://github.com/apache/spark/pull/834
>> > >
>> > > Can anyone comment if it's a good approach?
>> > >
>> > > Thanks.
>> > >
>> > >
>> > > Sincerely,
>> > >
>> > > DB Tsai
>> > > -------------------------------------------------------
>> > > My Blog: https://www.dbtsai.com
>> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>> > >
>> > >
>> > > On Mon, May 19, 2014 at 7:42 PM, DB Tsai <db...@stanford.edu> wrote:
>> > >
>> > >> Good summary! We fixed it in branch 0.9 since our production is still
>> in
>> > >> 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0
>> > >> tonight.
>> > >>
>> > >>
>> > >> Sincerely,
>> > >>
>> > >> DB Tsai
>> > >> -------------------------------------------------------
>> > >> My Blog: https://www.dbtsai.com
>> > >> LinkedIn: https://www.linkedin.com/in/dbtsai
>> > >>
>> > >>
>> > >> On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <sandy.ryza@cloudera.com
>> > >wrote:
>> > >>
>> > >>> It just hit me why this problem is showing up on YARN and not on
>> > >>> standalone.
>> > >>>
>> > >>> The relevant difference between YARN and standalone is that, on YARN,
>> > the
>> > >>> app jar is loaded by the system classloader instead of Spark's custom
>> > URL
>> > >>> classloader.
>> > >>>
>> > >>> On YARN, the system classloader knows about [the classes in the spark
>> > >>> jars,
>> > >>> the classes in the primary app jar].   The custom classloader knows
>> > about
>> > >>> [the classes in secondary app jars] and has the system classloader as
>> > its
>> > >>> parent.
>> > >>>
>> > >>> A few relevant facts (mostly redundant with what Sean pointed out):
>> > >>> * Every class has a classloader that loaded it.
>> > >>> * When an object of class B is instantiated inside of class A, the
>> > >>> classloader used for loading B is the classloader that was used for
>> > >>> loading
>> > >>> A.
>> > >>> * When a classloader fails to load a class, it lets its parent
>> > classloader
>> > >>> try.  If its parent succeeds, its parent becomes the "classloader
>> that
>> > >>> loaded it".
>> > >>>
>> > >>> So suppose class B is in a secondary app jar and class A is in the
>> > primary
>> > >>> app jar:
>> > >>> 1. The custom classloader will try to load class A.
>> > >>> 2. It will fail, because it only knows about the secondary jars.
>> > >>> 3. It will delegate to its parent, the system classloader.
>> > >>> 4. The system classloader will succeed, because it knows about the
>> > primary
>> > >>> app jar.
>> > >>> 5. A's classloader will be the system classloader.
>> > >>> 6. A tries to instantiate an instance of class B.
>> > >>> 7. B will be loaded with A's classloader, which is the system
>> > classloader.
>> > >>> 8. Loading B will fail, because A's classloader, which is the system
>> > >>> classloader, doesn't know about the secondary app jars.
>> > >>>
>> > >>> In Spark standalone, A and B are both loaded by the custom
>> > classloader, so
>> > >>> this issue doesn't come up.
>> > >>>
>> > >>> -Sandy
>> > >>>
>> > >>> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <pwendell@gmail.com
>> >
>> > >>> wrote:
>> > >>>
>> > >>> > Having a user add define a custom class inside of an added jar and
>> > >>> > instantiate it directly inside of an executor is definitely
>> supported
>> > >>> > in Spark and has been for a really long time (several years). This
>> is
>> > >>> > something we do all the time in Spark.
>> > >>> >
>> > >>> > DB - I'd hold off on a re-architecting of this until we identify
>> > >>> > exactly what is causing the bug you are running into.
>> > >>> >
>> > >>> > In a nutshell, when the bytecode "new Foo()" is run on the
>> executor,
>> > >>> > it will ask the driver for the class over HTTP using a custom
>> > >>> > classloader. Something in that pipeline is breaking here, possibly
>> > >>> > related to the YARN deployment stuff.
>> > >>> >
>> > >>> >
>> > >>> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <so...@cloudera.com>
>> > wrote:
>> > >>> > > I don't think a customer classloader is necessary.
>> > >>> > >
>> > >>> > > Well, it occurs to me that this is no new problem. Hadoop,
>> Tomcat,
>> > etc
>> > >>> > > all run custom user code that creates new user objects without
>> > >>> > > reflection. I should go see how that's done. Maybe it's totally
>> > valid
>> > >>> > > to set the thread's context classloader for just this purpose,
>> and
>> > I
>> > >>> > > am not thinking clearly.
>> > >>> > >
>> > >>> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <
>> andrew@andrewash.com>
>> > >>> > wrote:
>> > >>> > >> Sounds like the problem is that classloaders always look in
>> their
>> > >>> > parents
>> > >>> > >> before themselves, and Spark users want executors to pick up
>> > classes
>> > >>> > from
>> > >>> > >> their custom code before the ones in Spark plus its
>> dependencies.
>> > >>> > >>
>> > >>> > >> Would a custom classloader that delegates to the parent after
>> > first
>> > >>> > >> checking itself fix this up?
>> > >>> > >>
>> > >>> > >>
>> > >>> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <db...@stanford.edu>
>> > >>> wrote:
>> > >>> > >>
>> > >>> > >>> Hi Sean,
>> > >>> > >>>
>> > >>> > >>> It's true that the issue here is classloader, and due to the
>> > >>> > classloader
>> > >>> > >>> delegation model, users have to use reflection in the executors
>> > to
>> > >>> > pick up
>> > >>> > >>> the classloader in order to use those classes added by
>> sc.addJars
>> > >>> APIs.
>> > >>> > >>> However, it's very inconvenience for users, and not documented
>> in
>> > >>> > spark.
>> > >>> > >>>
>> > >>> > >>> I'm working on a patch to solve it by calling the protected
>> > method
>> > >>> > addURL
>> > >>> > >>> in URLClassLoader to update the current default classloader, so
>> > no
>> > >>> > >>> customClassLoader anymore. I wonder if this is an good way to
>> go.
>> > >>> > >>>
>> > >>> > >>>   private def addURL(url: URL, loader: URLClassLoader){
>> > >>> > >>>     try {
>> > >>> > >>>       val method: Method =
>> > >>> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL",
>> classOf[URL])
>> > >>> > >>>       method.setAccessible(true)
>> > >>> > >>>       method.invoke(loader, url)
>> > >>> > >>>     }
>> > >>> > >>>     catch {
>> > >>> > >>>       case t: Throwable => {
>> > >>> > >>>         throw new IOException("Error, could not add URL to
>> system
>> > >>> > >>> classloader")
>> > >>> > >>>       }
>> > >>> > >>>     }
>> > >>> > >>>   }
>> > >>> > >>>
>> > >>> > >>>
>> > >>> > >>>
>> > >>> > >>> Sincerely,
>> > >>> > >>>
>> > >>> > >>> DB Tsai
>> > >>> > >>> -------------------------------------------------------
>> > >>> > >>> My Blog: https://www.dbtsai.com
>> > >>> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai
>> > >>> > >>>
>> > >>> > >>>
>> > >>> > >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <
>> sowen@cloudera.com>
>> > >>> > wrote:
>> > >>> > >>>
>> > >>> > >>> > I might be stating the obvious for everyone, but the issue
>> > here is
>> > >>> > not
>> > >>> > >>> > reflection or the source of the JAR, but the ClassLoader. The
>> > >>> basic
>> > >>> > >>> > rules are this.
>> > >>> > >>> >
>> > >>> > >>> > "new Foo" will use the ClassLoader that defines Foo. This is
>> > >>> usually
>> > >>> > >>> > the ClassLoader that loaded whatever it is that first
>> > referenced
>> > >>> Foo
>> > >>> > >>> > and caused it to be loaded -- usually the ClassLoader holding
>> > your
>> > >>> > >>> > other app classes.
>> > >>> > >>> >
>> > >>> > >>> > ClassLoaders can have a parent-child relationship.
>> ClassLoaders
>> > >>> > always
>> > >>> > >>> > look in their parent before themselves.
>> > >>> > >>> >
>> > >>> > >>> > (Careful then -- in contexts like Hadoop or Tomcat where your
>> > app
>> > >>> is
>> > >>> > >>> > loaded in a child ClassLoader, and you reference a class that
>> > >>> Hadoop
>> > >>> > >>> > or Tomcat also has (like a lib class) you will get the
>> > container's
>> > >>> > >>> > version!)
>> > >>> > >>> >
>> > >>> > >>> > When you load an external JAR it has a separate ClassLoader
>> > which
>> > >>> > does
>> > >>> > >>> > not necessarily bear any relation to the one containing your
>> > app
>> > >>> > >>> > classes, so yeah it is not generally going to make "new Foo"
>> > work.
>> > >>> > >>> >
>> > >>> > >>> > Reflection lets you pick the ClassLoader, yes.
>> > >>> > >>> >
>> > >>> > >>> > I would not call setContextClassLoader.
>> > >>> > >>> >
>> > >>> > >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <
>> > >>> > sandy.ryza@cloudera.com>
>> > >>> > >>> > wrote:
>> > >>> > >>> > > I spoke with DB offline about this a little while ago and
>> he
>> > >>> > confirmed
>> > >>> > >>> > that
>> > >>> > >>> > > he was able to access the jar from the driver.
>> > >>> > >>> > >
>> > >>> > >>> > > The issue appears to be a general Java issue: you can't
>> > directly
>> > >>> > >>> > > instantiate a class from a dynamically loaded jar.
>> > >>> > >>> > >
>> > >>> > >>> > > I reproduced it locally outside of Spark with:
>> > >>> > >>> > > ---
>> > >>> > >>> > >     URLClassLoader urlClassLoader = new URLClassLoader(new
>> > >>> URL[] {
>> > >>> > new
>> > >>> > >>> > > File("myotherjar.jar").toURI().toURL() }, null);
>> > >>> > >>> > >
>> > >>> Thread.currentThread().setContextClassLoader(urlClassLoader);
>> > >>> > >>> > >     MyClassFromMyOtherJar obj = new
>> MyClassFromMyOtherJar();
>> > >>> > >>> > > ---
>> > >>> > >>> > >
>> > >>> > >>> > > I was able to load the class with reflection.
>> > >>> > >>> >
>> > >>> > >>>
>> > >>> >
>> > >>>
>> > >>
>> > >>
>> >
>>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by DB Tsai <db...@stanford.edu>.
This will be another separate story.

Since in the yarn deployment, as Sandy said, the app.jar will be always in
the systemclassloader which means any object instantiated in app.jar will
have parent loader of systemclassloader instead of custom one. As a result,
the custom class loader in yarn will never work without specifically using
reflection.

Solution will be not using system classloader in the classloader hierarchy,
and add all the resources in system one into custom one. This is the
approach of tomcat takes.

Or we can directly overwirte the system class loader by calling the
protected method `addURL` which will not work and throw exception if the
code is wrapped in security manager.


Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Wed, May 21, 2014 at 1:13 PM, Sandy Ryza <sa...@cloudera.com> wrote:

> This will solve the issue for jars added upon application submission, but,
> on top of this, we need to make sure that anything dynamically added
> through sc.addJar works as well.
>
> To do so, we need to make sure that any jars retrieved via the driver's
> HTTP server are loaded by the same classloader that loads the jars given on
> app submission.  To achieve this, we need to either use the same
> classloader for both system jars and user jars, or make sure that the user
> jars given on app submission are under the same classloader used for
> dynamically added jars.
>
> On Tue, May 20, 2014 at 5:59 PM, Xiangrui Meng <me...@gmail.com> wrote:
>
> > Talked with Sandy and DB offline. I think the best solution is sending
> > the secondary jars to the distributed cache of all containers rather
> > than just the master, and set the classpath to include spark jar,
> > primary app jar, and secondary jars before executor starts. In this
> > way, user only needs to specify secondary jars via --jars instead of
> > calling sc.addJar inside the code. It also solves the scalability
> > problem of serving all the jars via http.
> >
> > If this solution sounds good, I can try to make a patch.
> >
> > Best,
> > Xiangrui
> >
> > On Mon, May 19, 2014 at 10:04 PM, DB Tsai <db...@stanford.edu> wrote:
> > > In 1.0, there is a new option for users to choose which classloader has
> > > higher priority via spark.files.userClassPathFirst, I decided to submit
> > the
> > > PR for 0.9 first. We use this patch in our lab and we can use those
> jars
> > > added by sc.addJar without reflection.
> > >
> > > https://github.com/apache/spark/pull/834
> > >
> > > Can anyone comment if it's a good approach?
> > >
> > > Thanks.
> > >
> > >
> > > Sincerely,
> > >
> > > DB Tsai
> > > -------------------------------------------------------
> > > My Blog: https://www.dbtsai.com
> > > LinkedIn: https://www.linkedin.com/in/dbtsai
> > >
> > >
> > > On Mon, May 19, 2014 at 7:42 PM, DB Tsai <db...@stanford.edu> wrote:
> > >
> > >> Good summary! We fixed it in branch 0.9 since our production is still
> in
> > >> 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0
> > >> tonight.
> > >>
> > >>
> > >> Sincerely,
> > >>
> > >> DB Tsai
> > >> -------------------------------------------------------
> > >> My Blog: https://www.dbtsai.com
> > >> LinkedIn: https://www.linkedin.com/in/dbtsai
> > >>
> > >>
> > >> On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <sandy.ryza@cloudera.com
> > >wrote:
> > >>
> > >>> It just hit me why this problem is showing up on YARN and not on
> > >>> standalone.
> > >>>
> > >>> The relevant difference between YARN and standalone is that, on YARN,
> > the
> > >>> app jar is loaded by the system classloader instead of Spark's custom
> > URL
> > >>> classloader.
> > >>>
> > >>> On YARN, the system classloader knows about [the classes in the spark
> > >>> jars,
> > >>> the classes in the primary app jar].   The custom classloader knows
> > about
> > >>> [the classes in secondary app jars] and has the system classloader as
> > its
> > >>> parent.
> > >>>
> > >>> A few relevant facts (mostly redundant with what Sean pointed out):
> > >>> * Every class has a classloader that loaded it.
> > >>> * When an object of class B is instantiated inside of class A, the
> > >>> classloader used for loading B is the classloader that was used for
> > >>> loading
> > >>> A.
> > >>> * When a classloader fails to load a class, it lets its parent
> > classloader
> > >>> try.  If its parent succeeds, its parent becomes the "classloader
> that
> > >>> loaded it".
> > >>>
> > >>> So suppose class B is in a secondary app jar and class A is in the
> > primary
> > >>> app jar:
> > >>> 1. The custom classloader will try to load class A.
> > >>> 2. It will fail, because it only knows about the secondary jars.
> > >>> 3. It will delegate to its parent, the system classloader.
> > >>> 4. The system classloader will succeed, because it knows about the
> > primary
> > >>> app jar.
> > >>> 5. A's classloader will be the system classloader.
> > >>> 6. A tries to instantiate an instance of class B.
> > >>> 7. B will be loaded with A's classloader, which is the system
> > classloader.
> > >>> 8. Loading B will fail, because A's classloader, which is the system
> > >>> classloader, doesn't know about the secondary app jars.
> > >>>
> > >>> In Spark standalone, A and B are both loaded by the custom
> > classloader, so
> > >>> this issue doesn't come up.
> > >>>
> > >>> -Sandy
> > >>>
> > >>> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <pwendell@gmail.com
> >
> > >>> wrote:
> > >>>
> > >>> > Having a user add define a custom class inside of an added jar and
> > >>> > instantiate it directly inside of an executor is definitely
> supported
> > >>> > in Spark and has been for a really long time (several years). This
> is
> > >>> > something we do all the time in Spark.
> > >>> >
> > >>> > DB - I'd hold off on a re-architecting of this until we identify
> > >>> > exactly what is causing the bug you are running into.
> > >>> >
> > >>> > In a nutshell, when the bytecode "new Foo()" is run on the
> executor,
> > >>> > it will ask the driver for the class over HTTP using a custom
> > >>> > classloader. Something in that pipeline is breaking here, possibly
> > >>> > related to the YARN deployment stuff.
> > >>> >
> > >>> >
> > >>> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <so...@cloudera.com>
> > wrote:
> > >>> > > I don't think a customer classloader is necessary.
> > >>> > >
> > >>> > > Well, it occurs to me that this is no new problem. Hadoop,
> Tomcat,
> > etc
> > >>> > > all run custom user code that creates new user objects without
> > >>> > > reflection. I should go see how that's done. Maybe it's totally
> > valid
> > >>> > > to set the thread's context classloader for just this purpose,
> and
> > I
> > >>> > > am not thinking clearly.
> > >>> > >
> > >>> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <
> andrew@andrewash.com>
> > >>> > wrote:
> > >>> > >> Sounds like the problem is that classloaders always look in
> their
> > >>> > parents
> > >>> > >> before themselves, and Spark users want executors to pick up
> > classes
> > >>> > from
> > >>> > >> their custom code before the ones in Spark plus its
> dependencies.
> > >>> > >>
> > >>> > >> Would a custom classloader that delegates to the parent after
> > first
> > >>> > >> checking itself fix this up?
> > >>> > >>
> > >>> > >>
> > >>> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <db...@stanford.edu>
> > >>> wrote:
> > >>> > >>
> > >>> > >>> Hi Sean,
> > >>> > >>>
> > >>> > >>> It's true that the issue here is classloader, and due to the
> > >>> > classloader
> > >>> > >>> delegation model, users have to use reflection in the executors
> > to
> > >>> > pick up
> > >>> > >>> the classloader in order to use those classes added by
> sc.addJars
> > >>> APIs.
> > >>> > >>> However, it's very inconvenience for users, and not documented
> in
> > >>> > spark.
> > >>> > >>>
> > >>> > >>> I'm working on a patch to solve it by calling the protected
> > method
> > >>> > addURL
> > >>> > >>> in URLClassLoader to update the current default classloader, so
> > no
> > >>> > >>> customClassLoader anymore. I wonder if this is an good way to
> go.
> > >>> > >>>
> > >>> > >>>   private def addURL(url: URL, loader: URLClassLoader){
> > >>> > >>>     try {
> > >>> > >>>       val method: Method =
> > >>> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL",
> classOf[URL])
> > >>> > >>>       method.setAccessible(true)
> > >>> > >>>       method.invoke(loader, url)
> > >>> > >>>     }
> > >>> > >>>     catch {
> > >>> > >>>       case t: Throwable => {
> > >>> > >>>         throw new IOException("Error, could not add URL to
> system
> > >>> > >>> classloader")
> > >>> > >>>       }
> > >>> > >>>     }
> > >>> > >>>   }
> > >>> > >>>
> > >>> > >>>
> > >>> > >>>
> > >>> > >>> Sincerely,
> > >>> > >>>
> > >>> > >>> DB Tsai
> > >>> > >>> -------------------------------------------------------
> > >>> > >>> My Blog: https://www.dbtsai.com
> > >>> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai
> > >>> > >>>
> > >>> > >>>
> > >>> > >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <
> sowen@cloudera.com>
> > >>> > wrote:
> > >>> > >>>
> > >>> > >>> > I might be stating the obvious for everyone, but the issue
> > here is
> > >>> > not
> > >>> > >>> > reflection or the source of the JAR, but the ClassLoader. The
> > >>> basic
> > >>> > >>> > rules are this.
> > >>> > >>> >
> > >>> > >>> > "new Foo" will use the ClassLoader that defines Foo. This is
> > >>> usually
> > >>> > >>> > the ClassLoader that loaded whatever it is that first
> > referenced
> > >>> Foo
> > >>> > >>> > and caused it to be loaded -- usually the ClassLoader holding
> > your
> > >>> > >>> > other app classes.
> > >>> > >>> >
> > >>> > >>> > ClassLoaders can have a parent-child relationship.
> ClassLoaders
> > >>> > always
> > >>> > >>> > look in their parent before themselves.
> > >>> > >>> >
> > >>> > >>> > (Careful then -- in contexts like Hadoop or Tomcat where your
> > app
> > >>> is
> > >>> > >>> > loaded in a child ClassLoader, and you reference a class that
> > >>> Hadoop
> > >>> > >>> > or Tomcat also has (like a lib class) you will get the
> > container's
> > >>> > >>> > version!)
> > >>> > >>> >
> > >>> > >>> > When you load an external JAR it has a separate ClassLoader
> > which
> > >>> > does
> > >>> > >>> > not necessarily bear any relation to the one containing your
> > app
> > >>> > >>> > classes, so yeah it is not generally going to make "new Foo"
> > work.
> > >>> > >>> >
> > >>> > >>> > Reflection lets you pick the ClassLoader, yes.
> > >>> > >>> >
> > >>> > >>> > I would not call setContextClassLoader.
> > >>> > >>> >
> > >>> > >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <
> > >>> > sandy.ryza@cloudera.com>
> > >>> > >>> > wrote:
> > >>> > >>> > > I spoke with DB offline about this a little while ago and
> he
> > >>> > confirmed
> > >>> > >>> > that
> > >>> > >>> > > he was able to access the jar from the driver.
> > >>> > >>> > >
> > >>> > >>> > > The issue appears to be a general Java issue: you can't
> > directly
> > >>> > >>> > > instantiate a class from a dynamically loaded jar.
> > >>> > >>> > >
> > >>> > >>> > > I reproduced it locally outside of Spark with:
> > >>> > >>> > > ---
> > >>> > >>> > >     URLClassLoader urlClassLoader = new URLClassLoader(new
> > >>> URL[] {
> > >>> > new
> > >>> > >>> > > File("myotherjar.jar").toURI().toURL() }, null);
> > >>> > >>> > >
> > >>> Thread.currentThread().setContextClassLoader(urlClassLoader);
> > >>> > >>> > >     MyClassFromMyOtherJar obj = new
> MyClassFromMyOtherJar();
> > >>> > >>> > > ---
> > >>> > >>> > >
> > >>> > >>> > > I was able to load the class with reflection.
> > >>> > >>> >
> > >>> > >>>
> > >>> >
> > >>>
> > >>
> > >>
> >
>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Sandy Ryza <sa...@cloudera.com>.
This will solve the issue for jars added upon application submission, but,
on top of this, we need to make sure that anything dynamically added
through sc.addJar works as well.

To do so, we need to make sure that any jars retrieved via the driver's
HTTP server are loaded by the same classloader that loads the jars given on
app submission.  To achieve this, we need to either use the same
classloader for both system jars and user jars, or make sure that the user
jars given on app submission are under the same classloader used for
dynamically added jars.

On Tue, May 20, 2014 at 5:59 PM, Xiangrui Meng <me...@gmail.com> wrote:

> Talked with Sandy and DB offline. I think the best solution is sending
> the secondary jars to the distributed cache of all containers rather
> than just the master, and set the classpath to include spark jar,
> primary app jar, and secondary jars before executor starts. In this
> way, user only needs to specify secondary jars via --jars instead of
> calling sc.addJar inside the code. It also solves the scalability
> problem of serving all the jars via http.
>
> If this solution sounds good, I can try to make a patch.
>
> Best,
> Xiangrui
>
> On Mon, May 19, 2014 at 10:04 PM, DB Tsai <db...@stanford.edu> wrote:
> > In 1.0, there is a new option for users to choose which classloader has
> > higher priority via spark.files.userClassPathFirst, I decided to submit
> the
> > PR for 0.9 first. We use this patch in our lab and we can use those jars
> > added by sc.addJar without reflection.
> >
> > https://github.com/apache/spark/pull/834
> >
> > Can anyone comment if it's a good approach?
> >
> > Thanks.
> >
> >
> > Sincerely,
> >
> > DB Tsai
> > -------------------------------------------------------
> > My Blog: https://www.dbtsai.com
> > LinkedIn: https://www.linkedin.com/in/dbtsai
> >
> >
> > On Mon, May 19, 2014 at 7:42 PM, DB Tsai <db...@stanford.edu> wrote:
> >
> >> Good summary! We fixed it in branch 0.9 since our production is still in
> >> 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0
> >> tonight.
> >>
> >>
> >> Sincerely,
> >>
> >> DB Tsai
> >> -------------------------------------------------------
> >> My Blog: https://www.dbtsai.com
> >> LinkedIn: https://www.linkedin.com/in/dbtsai
> >>
> >>
> >> On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <sandy.ryza@cloudera.com
> >wrote:
> >>
> >>> It just hit me why this problem is showing up on YARN and not on
> >>> standalone.
> >>>
> >>> The relevant difference between YARN and standalone is that, on YARN,
> the
> >>> app jar is loaded by the system classloader instead of Spark's custom
> URL
> >>> classloader.
> >>>
> >>> On YARN, the system classloader knows about [the classes in the spark
> >>> jars,
> >>> the classes in the primary app jar].   The custom classloader knows
> about
> >>> [the classes in secondary app jars] and has the system classloader as
> its
> >>> parent.
> >>>
> >>> A few relevant facts (mostly redundant with what Sean pointed out):
> >>> * Every class has a classloader that loaded it.
> >>> * When an object of class B is instantiated inside of class A, the
> >>> classloader used for loading B is the classloader that was used for
> >>> loading
> >>> A.
> >>> * When a classloader fails to load a class, it lets its parent
> classloader
> >>> try.  If its parent succeeds, its parent becomes the "classloader that
> >>> loaded it".
> >>>
> >>> So suppose class B is in a secondary app jar and class A is in the
> primary
> >>> app jar:
> >>> 1. The custom classloader will try to load class A.
> >>> 2. It will fail, because it only knows about the secondary jars.
> >>> 3. It will delegate to its parent, the system classloader.
> >>> 4. The system classloader will succeed, because it knows about the
> primary
> >>> app jar.
> >>> 5. A's classloader will be the system classloader.
> >>> 6. A tries to instantiate an instance of class B.
> >>> 7. B will be loaded with A's classloader, which is the system
> classloader.
> >>> 8. Loading B will fail, because A's classloader, which is the system
> >>> classloader, doesn't know about the secondary app jars.
> >>>
> >>> In Spark standalone, A and B are both loaded by the custom
> classloader, so
> >>> this issue doesn't come up.
> >>>
> >>> -Sandy
> >>>
> >>> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <pw...@gmail.com>
> >>> wrote:
> >>>
> >>> > Having a user add define a custom class inside of an added jar and
> >>> > instantiate it directly inside of an executor is definitely supported
> >>> > in Spark and has been for a really long time (several years). This is
> >>> > something we do all the time in Spark.
> >>> >
> >>> > DB - I'd hold off on a re-architecting of this until we identify
> >>> > exactly what is causing the bug you are running into.
> >>> >
> >>> > In a nutshell, when the bytecode "new Foo()" is run on the executor,
> >>> > it will ask the driver for the class over HTTP using a custom
> >>> > classloader. Something in that pipeline is breaking here, possibly
> >>> > related to the YARN deployment stuff.
> >>> >
> >>> >
> >>> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <so...@cloudera.com>
> wrote:
> >>> > > I don't think a customer classloader is necessary.
> >>> > >
> >>> > > Well, it occurs to me that this is no new problem. Hadoop, Tomcat,
> etc
> >>> > > all run custom user code that creates new user objects without
> >>> > > reflection. I should go see how that's done. Maybe it's totally
> valid
> >>> > > to set the thread's context classloader for just this purpose, and
> I
> >>> > > am not thinking clearly.
> >>> > >
> >>> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <an...@andrewash.com>
> >>> > wrote:
> >>> > >> Sounds like the problem is that classloaders always look in their
> >>> > parents
> >>> > >> before themselves, and Spark users want executors to pick up
> classes
> >>> > from
> >>> > >> their custom code before the ones in Spark plus its dependencies.
> >>> > >>
> >>> > >> Would a custom classloader that delegates to the parent after
> first
> >>> > >> checking itself fix this up?
> >>> > >>
> >>> > >>
> >>> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <db...@stanford.edu>
> >>> wrote:
> >>> > >>
> >>> > >>> Hi Sean,
> >>> > >>>
> >>> > >>> It's true that the issue here is classloader, and due to the
> >>> > classloader
> >>> > >>> delegation model, users have to use reflection in the executors
> to
> >>> > pick up
> >>> > >>> the classloader in order to use those classes added by sc.addJars
> >>> APIs.
> >>> > >>> However, it's very inconvenience for users, and not documented in
> >>> > spark.
> >>> > >>>
> >>> > >>> I'm working on a patch to solve it by calling the protected
> method
> >>> > addURL
> >>> > >>> in URLClassLoader to update the current default classloader, so
> no
> >>> > >>> customClassLoader anymore. I wonder if this is an good way to go.
> >>> > >>>
> >>> > >>>   private def addURL(url: URL, loader: URLClassLoader){
> >>> > >>>     try {
> >>> > >>>       val method: Method =
> >>> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL", classOf[URL])
> >>> > >>>       method.setAccessible(true)
> >>> > >>>       method.invoke(loader, url)
> >>> > >>>     }
> >>> > >>>     catch {
> >>> > >>>       case t: Throwable => {
> >>> > >>>         throw new IOException("Error, could not add URL to system
> >>> > >>> classloader")
> >>> > >>>       }
> >>> > >>>     }
> >>> > >>>   }
> >>> > >>>
> >>> > >>>
> >>> > >>>
> >>> > >>> Sincerely,
> >>> > >>>
> >>> > >>> DB Tsai
> >>> > >>> -------------------------------------------------------
> >>> > >>> My Blog: https://www.dbtsai.com
> >>> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai
> >>> > >>>
> >>> > >>>
> >>> > >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <so...@cloudera.com>
> >>> > wrote:
> >>> > >>>
> >>> > >>> > I might be stating the obvious for everyone, but the issue
> here is
> >>> > not
> >>> > >>> > reflection or the source of the JAR, but the ClassLoader. The
> >>> basic
> >>> > >>> > rules are this.
> >>> > >>> >
> >>> > >>> > "new Foo" will use the ClassLoader that defines Foo. This is
> >>> usually
> >>> > >>> > the ClassLoader that loaded whatever it is that first
> referenced
> >>> Foo
> >>> > >>> > and caused it to be loaded -- usually the ClassLoader holding
> your
> >>> > >>> > other app classes.
> >>> > >>> >
> >>> > >>> > ClassLoaders can have a parent-child relationship. ClassLoaders
> >>> > always
> >>> > >>> > look in their parent before themselves.
> >>> > >>> >
> >>> > >>> > (Careful then -- in contexts like Hadoop or Tomcat where your
> app
> >>> is
> >>> > >>> > loaded in a child ClassLoader, and you reference a class that
> >>> Hadoop
> >>> > >>> > or Tomcat also has (like a lib class) you will get the
> container's
> >>> > >>> > version!)
> >>> > >>> >
> >>> > >>> > When you load an external JAR it has a separate ClassLoader
> which
> >>> > does
> >>> > >>> > not necessarily bear any relation to the one containing your
> app
> >>> > >>> > classes, so yeah it is not generally going to make "new Foo"
> work.
> >>> > >>> >
> >>> > >>> > Reflection lets you pick the ClassLoader, yes.
> >>> > >>> >
> >>> > >>> > I would not call setContextClassLoader.
> >>> > >>> >
> >>> > >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <
> >>> > sandy.ryza@cloudera.com>
> >>> > >>> > wrote:
> >>> > >>> > > I spoke with DB offline about this a little while ago and he
> >>> > confirmed
> >>> > >>> > that
> >>> > >>> > > he was able to access the jar from the driver.
> >>> > >>> > >
> >>> > >>> > > The issue appears to be a general Java issue: you can't
> directly
> >>> > >>> > > instantiate a class from a dynamically loaded jar.
> >>> > >>> > >
> >>> > >>> > > I reproduced it locally outside of Spark with:
> >>> > >>> > > ---
> >>> > >>> > >     URLClassLoader urlClassLoader = new URLClassLoader(new
> >>> URL[] {
> >>> > new
> >>> > >>> > > File("myotherjar.jar").toURI().toURL() }, null);
> >>> > >>> > >
> >>> Thread.currentThread().setContextClassLoader(urlClassLoader);
> >>> > >>> > >     MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
> >>> > >>> > > ---
> >>> > >>> > >
> >>> > >>> > > I was able to load the class with reflection.
> >>> > >>> >
> >>> > >>>
> >>> >
> >>>
> >>
> >>
>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Xiangrui Meng <me...@gmail.com>.
Talked with Sandy and DB offline. I think the best solution is sending
the secondary jars to the distributed cache of all containers rather
than just the master, and set the classpath to include spark jar,
primary app jar, and secondary jars before executor starts. In this
way, user only needs to specify secondary jars via --jars instead of
calling sc.addJar inside the code. It also solves the scalability
problem of serving all the jars via http.

If this solution sounds good, I can try to make a patch.

Best,
Xiangrui

On Mon, May 19, 2014 at 10:04 PM, DB Tsai <db...@stanford.edu> wrote:
> In 1.0, there is a new option for users to choose which classloader has
> higher priority via spark.files.userClassPathFirst, I decided to submit the
> PR for 0.9 first. We use this patch in our lab and we can use those jars
> added by sc.addJar without reflection.
>
> https://github.com/apache/spark/pull/834
>
> Can anyone comment if it's a good approach?
>
> Thanks.
>
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Mon, May 19, 2014 at 7:42 PM, DB Tsai <db...@stanford.edu> wrote:
>
>> Good summary! We fixed it in branch 0.9 since our production is still in
>> 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0
>> tonight.
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> -------------------------------------------------------
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <sa...@cloudera.com>wrote:
>>
>>> It just hit me why this problem is showing up on YARN and not on
>>> standalone.
>>>
>>> The relevant difference between YARN and standalone is that, on YARN, the
>>> app jar is loaded by the system classloader instead of Spark's custom URL
>>> classloader.
>>>
>>> On YARN, the system classloader knows about [the classes in the spark
>>> jars,
>>> the classes in the primary app jar].   The custom classloader knows about
>>> [the classes in secondary app jars] and has the system classloader as its
>>> parent.
>>>
>>> A few relevant facts (mostly redundant with what Sean pointed out):
>>> * Every class has a classloader that loaded it.
>>> * When an object of class B is instantiated inside of class A, the
>>> classloader used for loading B is the classloader that was used for
>>> loading
>>> A.
>>> * When a classloader fails to load a class, it lets its parent classloader
>>> try.  If its parent succeeds, its parent becomes the "classloader that
>>> loaded it".
>>>
>>> So suppose class B is in a secondary app jar and class A is in the primary
>>> app jar:
>>> 1. The custom classloader will try to load class A.
>>> 2. It will fail, because it only knows about the secondary jars.
>>> 3. It will delegate to its parent, the system classloader.
>>> 4. The system classloader will succeed, because it knows about the primary
>>> app jar.
>>> 5. A's classloader will be the system classloader.
>>> 6. A tries to instantiate an instance of class B.
>>> 7. B will be loaded with A's classloader, which is the system classloader.
>>> 8. Loading B will fail, because A's classloader, which is the system
>>> classloader, doesn't know about the secondary app jars.
>>>
>>> In Spark standalone, A and B are both loaded by the custom classloader, so
>>> this issue doesn't come up.
>>>
>>> -Sandy
>>>
>>> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <pw...@gmail.com>
>>> wrote:
>>>
>>> > Having a user add define a custom class inside of an added jar and
>>> > instantiate it directly inside of an executor is definitely supported
>>> > in Spark and has been for a really long time (several years). This is
>>> > something we do all the time in Spark.
>>> >
>>> > DB - I'd hold off on a re-architecting of this until we identify
>>> > exactly what is causing the bug you are running into.
>>> >
>>> > In a nutshell, when the bytecode "new Foo()" is run on the executor,
>>> > it will ask the driver for the class over HTTP using a custom
>>> > classloader. Something in that pipeline is breaking here, possibly
>>> > related to the YARN deployment stuff.
>>> >
>>> >
>>> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <so...@cloudera.com> wrote:
>>> > > I don't think a customer classloader is necessary.
>>> > >
>>> > > Well, it occurs to me that this is no new problem. Hadoop, Tomcat, etc
>>> > > all run custom user code that creates new user objects without
>>> > > reflection. I should go see how that's done. Maybe it's totally valid
>>> > > to set the thread's context classloader for just this purpose, and I
>>> > > am not thinking clearly.
>>> > >
>>> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <an...@andrewash.com>
>>> > wrote:
>>> > >> Sounds like the problem is that classloaders always look in their
>>> > parents
>>> > >> before themselves, and Spark users want executors to pick up classes
>>> > from
>>> > >> their custom code before the ones in Spark plus its dependencies.
>>> > >>
>>> > >> Would a custom classloader that delegates to the parent after first
>>> > >> checking itself fix this up?
>>> > >>
>>> > >>
>>> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <db...@stanford.edu>
>>> wrote:
>>> > >>
>>> > >>> Hi Sean,
>>> > >>>
>>> > >>> It's true that the issue here is classloader, and due to the
>>> > classloader
>>> > >>> delegation model, users have to use reflection in the executors to
>>> > pick up
>>> > >>> the classloader in order to use those classes added by sc.addJars
>>> APIs.
>>> > >>> However, it's very inconvenience for users, and not documented in
>>> > spark.
>>> > >>>
>>> > >>> I'm working on a patch to solve it by calling the protected method
>>> > addURL
>>> > >>> in URLClassLoader to update the current default classloader, so no
>>> > >>> customClassLoader anymore. I wonder if this is an good way to go.
>>> > >>>
>>> > >>>   private def addURL(url: URL, loader: URLClassLoader){
>>> > >>>     try {
>>> > >>>       val method: Method =
>>> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL", classOf[URL])
>>> > >>>       method.setAccessible(true)
>>> > >>>       method.invoke(loader, url)
>>> > >>>     }
>>> > >>>     catch {
>>> > >>>       case t: Throwable => {
>>> > >>>         throw new IOException("Error, could not add URL to system
>>> > >>> classloader")
>>> > >>>       }
>>> > >>>     }
>>> > >>>   }
>>> > >>>
>>> > >>>
>>> > >>>
>>> > >>> Sincerely,
>>> > >>>
>>> > >>> DB Tsai
>>> > >>> -------------------------------------------------------
>>> > >>> My Blog: https://www.dbtsai.com
>>> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>> > >>>
>>> > >>>
>>> > >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <so...@cloudera.com>
>>> > wrote:
>>> > >>>
>>> > >>> > I might be stating the obvious for everyone, but the issue here is
>>> > not
>>> > >>> > reflection or the source of the JAR, but the ClassLoader. The
>>> basic
>>> > >>> > rules are this.
>>> > >>> >
>>> > >>> > "new Foo" will use the ClassLoader that defines Foo. This is
>>> usually
>>> > >>> > the ClassLoader that loaded whatever it is that first referenced
>>> Foo
>>> > >>> > and caused it to be loaded -- usually the ClassLoader holding your
>>> > >>> > other app classes.
>>> > >>> >
>>> > >>> > ClassLoaders can have a parent-child relationship. ClassLoaders
>>> > always
>>> > >>> > look in their parent before themselves.
>>> > >>> >
>>> > >>> > (Careful then -- in contexts like Hadoop or Tomcat where your app
>>> is
>>> > >>> > loaded in a child ClassLoader, and you reference a class that
>>> Hadoop
>>> > >>> > or Tomcat also has (like a lib class) you will get the container's
>>> > >>> > version!)
>>> > >>> >
>>> > >>> > When you load an external JAR it has a separate ClassLoader which
>>> > does
>>> > >>> > not necessarily bear any relation to the one containing your app
>>> > >>> > classes, so yeah it is not generally going to make "new Foo" work.
>>> > >>> >
>>> > >>> > Reflection lets you pick the ClassLoader, yes.
>>> > >>> >
>>> > >>> > I would not call setContextClassLoader.
>>> > >>> >
>>> > >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <
>>> > sandy.ryza@cloudera.com>
>>> > >>> > wrote:
>>> > >>> > > I spoke with DB offline about this a little while ago and he
>>> > confirmed
>>> > >>> > that
>>> > >>> > > he was able to access the jar from the driver.
>>> > >>> > >
>>> > >>> > > The issue appears to be a general Java issue: you can't directly
>>> > >>> > > instantiate a class from a dynamically loaded jar.
>>> > >>> > >
>>> > >>> > > I reproduced it locally outside of Spark with:
>>> > >>> > > ---
>>> > >>> > >     URLClassLoader urlClassLoader = new URLClassLoader(new
>>> URL[] {
>>> > new
>>> > >>> > > File("myotherjar.jar").toURI().toURL() }, null);
>>> > >>> > >
>>> Thread.currentThread().setContextClassLoader(urlClassLoader);
>>> > >>> > >     MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
>>> > >>> > > ---
>>> > >>> > >
>>> > >>> > > I was able to load the class with reflection.
>>> > >>> >
>>> > >>>
>>> >
>>>
>>
>>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Xiangrui Meng <me...@gmail.com>.
Hi DB,

I found it is a little hard to implement the solution I mentioned:

> Do not send the primary jar and secondary jars to executors'
> distributed cache. Instead, add them to "spark.jars" in SparkSubmit
> and serve them via http by called sc.addJar in SparkContext.

If you look at ApplicationMaster code, which is entry point in
yarn-cluster mode. It actually creates a thread of the user class
first and waits the user class to create a spark context. It means the
user class has to be on the classpath at that time. I think we need to
add the primary jar and secondary jars twice, once to system
classpath, and then to the executor classloader.

Best,
Xiangrui

On Wed, May 21, 2014 at 3:50 PM, DB Tsai <db...@stanford.edu> wrote:
> @Xiangrui
> How about we send the primary jar and secondary jars into distributed cache
> without adding them into the system classloader of executors. Then we add
> them using custom classloader so we don't need to call secondary jars
> through reflection in primary jar. This will be consistent to what we do in
> standalone mode, and also solve the scalability of jar distribution issue.
>
> @Koert
> Yes, that's why I suggest we can either ignore the parent classloader of
> custom class loader to solve this as you say. In this case, we need add the
> all the classpath of the system loader into our custom one (which doesn't
> have parent) so we will not miss the default java classes. This is how
> tomcat works.
>
> @Patrick
> I agree that we should have the fix by Xiangrui first, since it solves most
> of the use case. I don't know when people will use dynamical addJar in Yarn
> since it's most useful for interactive environment.
>
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Wed, May 21, 2014 at 2:57 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> db tsai, i do not think userClassPathFirst is working, unless the classes
>> you load dont reference any classes already loaded by the parent
>> classloader (a mostly hypothetical situation)... i filed a jira for this
>> here:
>> https://issues.apache.org/jira/browse/SPARK-1863
>>
>>
>>
>> On Tue, May 20, 2014 at 1:04 AM, DB Tsai <db...@stanford.edu> wrote:
>>
>> > In 1.0, there is a new option for users to choose which classloader has
>> > higher priority via spark.files.userClassPathFirst, I decided to submit
>> the
>> > PR for 0.9 first. We use this patch in our lab and we can use those jars
>> > added by sc.addJar without reflection.
>> >
>> > https://github.com/apache/spark/pull/834
>> >
>> > Can anyone comment if it's a good approach?
>> >
>> > Thanks.
>> >
>> >
>> > Sincerely,
>> >
>> > DB Tsai
>> > -------------------------------------------------------
>> > My Blog: https://www.dbtsai.com
>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>> >
>> >
>> > On Mon, May 19, 2014 at 7:42 PM, DB Tsai <db...@stanford.edu> wrote:
>> >
>> > > Good summary! We fixed it in branch 0.9 since our production is still
>> in
>> > > 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0
>> > > tonight.
>> > >
>> > >
>> > > Sincerely,
>> > >
>> > > DB Tsai
>> > > -------------------------------------------------------
>> > > My Blog: https://www.dbtsai.com
>> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>> > >
>> > >
>> > > On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <sandy.ryza@cloudera.com
>> > >wrote:
>> > >
>> > >> It just hit me why this problem is showing up on YARN and not on
>> > >> standalone.
>> > >>
>> > >> The relevant difference between YARN and standalone is that, on YARN,
>> > the
>> > >> app jar is loaded by the system classloader instead of Spark's custom
>> > URL
>> > >> classloader.
>> > >>
>> > >> On YARN, the system classloader knows about [the classes in the spark
>> > >> jars,
>> > >> the classes in the primary app jar].   The custom classloader knows
>> > about
>> > >> [the classes in secondary app jars] and has the system classloader as
>> > its
>> > >> parent.
>> > >>
>> > >> A few relevant facts (mostly redundant with what Sean pointed out):
>> > >> * Every class has a classloader that loaded it.
>> > >> * When an object of class B is instantiated inside of class A, the
>> > >> classloader used for loading B is the classloader that was used for
>> > >> loading
>> > >> A.
>> > >> * When a classloader fails to load a class, it lets its parent
>> > classloader
>> > >> try.  If its parent succeeds, its parent becomes the "classloader that
>> > >> loaded it".
>> > >>
>> > >> So suppose class B is in a secondary app jar and class A is in the
>> > primary
>> > >> app jar:
>> > >> 1. The custom classloader will try to load class A.
>> > >> 2. It will fail, because it only knows about the secondary jars.
>> > >> 3. It will delegate to its parent, the system classloader.
>> > >> 4. The system classloader will succeed, because it knows about the
>> > primary
>> > >> app jar.
>> > >> 5. A's classloader will be the system classloader.
>> > >> 6. A tries to instantiate an instance of class B.
>> > >> 7. B will be loaded with A's classloader, which is the system
>> > classloader.
>> > >> 8. Loading B will fail, because A's classloader, which is the system
>> > >> classloader, doesn't know about the secondary app jars.
>> > >>
>> > >> In Spark standalone, A and B are both loaded by the custom
>> classloader,
>> > so
>> > >> this issue doesn't come up.
>> > >>
>> > >> -Sandy
>> > >>
>> > >> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <pw...@gmail.com>
>> > >> wrote:
>> > >>
>> > >> > Having a user add define a custom class inside of an added jar and
>> > >> > instantiate it directly inside of an executor is definitely
>> supported
>> > >> > in Spark and has been for a really long time (several years). This
>> is
>> > >> > something we do all the time in Spark.
>> > >> >
>> > >> > DB - I'd hold off on a re-architecting of this until we identify
>> > >> > exactly what is causing the bug you are running into.
>> > >> >
>> > >> > In a nutshell, when the bytecode "new Foo()" is run on the executor,
>> > >> > it will ask the driver for the class over HTTP using a custom
>> > >> > classloader. Something in that pipeline is breaking here, possibly
>> > >> > related to the YARN deployment stuff.
>> > >> >
>> > >> >
>> > >> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <so...@cloudera.com>
>> > wrote:
>> > >> > > I don't think a customer classloader is necessary.
>> > >> > >
>> > >> > > Well, it occurs to me that this is no new problem. Hadoop, Tomcat,
>> > etc
>> > >> > > all run custom user code that creates new user objects without
>> > >> > > reflection. I should go see how that's done. Maybe it's totally
>> > valid
>> > >> > > to set the thread's context classloader for just this purpose,
>> and I
>> > >> > > am not thinking clearly.
>> > >> > >
>> > >> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <andrew@andrewash.com
>> >
>> > >> > wrote:
>> > >> > >> Sounds like the problem is that classloaders always look in their
>> > >> > parents
>> > >> > >> before themselves, and Spark users want executors to pick up
>> > classes
>> > >> > from
>> > >> > >> their custom code before the ones in Spark plus its dependencies.
>> > >> > >>
>> > >> > >> Would a custom classloader that delegates to the parent after
>> first
>> > >> > >> checking itself fix this up?
>> > >> > >>
>> > >> > >>
>> > >> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <db...@stanford.edu>
>> > >> wrote:
>> > >> > >>
>> > >> > >>> Hi Sean,
>> > >> > >>>
>> > >> > >>> It's true that the issue here is classloader, and due to the
>> > >> > classloader
>> > >> > >>> delegation model, users have to use reflection in the executors
>> to
>> > >> > pick up
>> > >> > >>> the classloader in order to use those classes added by
>> sc.addJars
>> > >> APIs.
>> > >> > >>> However, it's very inconvenience for users, and not documented
>> in
>> > >> > spark.
>> > >> > >>>
>> > >> > >>> I'm working on a patch to solve it by calling the protected
>> method
>> > >> > addURL
>> > >> > >>> in URLClassLoader to update the current default classloader, so
>> no
>> > >> > >>> customClassLoader anymore. I wonder if this is an good way to
>> go.
>> > >> > >>>
>> > >> > >>>   private def addURL(url: URL, loader: URLClassLoader){
>> > >> > >>>     try {
>> > >> > >>>       val method: Method =
>> > >> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL",
>> classOf[URL])
>> > >> > >>>       method.setAccessible(true)
>> > >> > >>>       method.invoke(loader, url)
>> > >> > >>>     }
>> > >> > >>>     catch {
>> > >> > >>>       case t: Throwable => {
>> > >> > >>>         throw new IOException("Error, could not add URL to
>> system
>> > >> > >>> classloader")
>> > >> > >>>       }
>> > >> > >>>     }
>> > >> > >>>   }
>> > >> > >>>
>> > >> > >>>
>> > >> > >>>
>> > >> > >>> Sincerely,
>> > >> > >>>
>> > >> > >>> DB Tsai
>> > >> > >>> -------------------------------------------------------
>> > >> > >>> My Blog: https://www.dbtsai.com
>> > >> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai
>> > >> > >>>
>> > >> > >>>
>> > >> > >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <sowen@cloudera.com
>> >
>> > >> > wrote:
>> > >> > >>>
>> > >> > >>> > I might be stating the obvious for everyone, but the issue
>> here
>> > is
>> > >> > not
>> > >> > >>> > reflection or the source of the JAR, but the ClassLoader. The
>> > >> basic
>> > >> > >>> > rules are this.
>> > >> > >>> >
>> > >> > >>> > "new Foo" will use the ClassLoader that defines Foo. This is
>> > >> usually
>> > >> > >>> > the ClassLoader that loaded whatever it is that first
>> referenced
>> > >> Foo
>> > >> > >>> > and caused it to be loaded -- usually the ClassLoader holding
>> > your
>> > >> > >>> > other app classes.
>> > >> > >>> >
>> > >> > >>> > ClassLoaders can have a parent-child relationship.
>> ClassLoaders
>> > >> > always
>> > >> > >>> > look in their parent before themselves.
>> > >> > >>> >
>> > >> > >>> > (Careful then -- in contexts like Hadoop or Tomcat where your
>> > app
>> > >> is
>> > >> > >>> > loaded in a child ClassLoader, and you reference a class that
>> > >> Hadoop
>> > >> > >>> > or Tomcat also has (like a lib class) you will get the
>> > container's
>> > >> > >>> > version!)
>> > >> > >>> >
>> > >> > >>> > When you load an external JAR it has a separate ClassLoader
>> > which
>> > >> > does
>> > >> > >>> > not necessarily bear any relation to the one containing your
>> app
>> > >> > >>> > classes, so yeah it is not generally going to make "new Foo"
>> > work.
>> > >> > >>> >
>> > >> > >>> > Reflection lets you pick the ClassLoader, yes.
>> > >> > >>> >
>> > >> > >>> > I would not call setContextClassLoader.
>> > >> > >>> >
>> > >> > >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <
>> > >> > sandy.ryza@cloudera.com>
>> > >> > >>> > wrote:
>> > >> > >>> > > I spoke with DB offline about this a little while ago and he
>> > >> > confirmed
>> > >> > >>> > that
>> > >> > >>> > > he was able to access the jar from the driver.
>> > >> > >>> > >
>> > >> > >>> > > The issue appears to be a general Java issue: you can't
>> > directly
>> > >> > >>> > > instantiate a class from a dynamically loaded jar.
>> > >> > >>> > >
>> > >> > >>> > > I reproduced it locally outside of Spark with:
>> > >> > >>> > > ---
>> > >> > >>> > >     URLClassLoader urlClassLoader = new URLClassLoader(new
>> > >> URL[] {
>> > >> > new
>> > >> > >>> > > File("myotherjar.jar").toURI().toURL() }, null);
>> > >> > >>> > >
>> > >> Thread.currentThread().setContextClassLoader(urlClassLoader);
>> > >> > >>> > >     MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
>> > >> > >>> > > ---
>> > >> > >>> > >
>> > >> > >>> > > I was able to load the class with reflection.
>> > >> > >>> >
>> > >> > >>>
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by DB Tsai <db...@stanford.edu>.
@Xiangrui
How about we send the primary jar and secondary jars into distributed cache
without adding them into the system classloader of executors. Then we add
them using custom classloader so we don't need to call secondary jars
through reflection in primary jar. This will be consistent to what we do in
standalone mode, and also solve the scalability of jar distribution issue.

@Koert
Yes, that's why I suggest we can either ignore the parent classloader of
custom class loader to solve this as you say. In this case, we need add the
all the classpath of the system loader into our custom one (which doesn't
have parent) so we will not miss the default java classes. This is how
tomcat works.

@Patrick
I agree that we should have the fix by Xiangrui first, since it solves most
of the use case. I don't know when people will use dynamical addJar in Yarn
since it's most useful for interactive environment.


Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Wed, May 21, 2014 at 2:57 PM, Koert Kuipers <ko...@tresata.com> wrote:

> db tsai, i do not think userClassPathFirst is working, unless the classes
> you load dont reference any classes already loaded by the parent
> classloader (a mostly hypothetical situation)... i filed a jira for this
> here:
> https://issues.apache.org/jira/browse/SPARK-1863
>
>
>
> On Tue, May 20, 2014 at 1:04 AM, DB Tsai <db...@stanford.edu> wrote:
>
> > In 1.0, there is a new option for users to choose which classloader has
> > higher priority via spark.files.userClassPathFirst, I decided to submit
> the
> > PR for 0.9 first. We use this patch in our lab and we can use those jars
> > added by sc.addJar without reflection.
> >
> > https://github.com/apache/spark/pull/834
> >
> > Can anyone comment if it's a good approach?
> >
> > Thanks.
> >
> >
> > Sincerely,
> >
> > DB Tsai
> > -------------------------------------------------------
> > My Blog: https://www.dbtsai.com
> > LinkedIn: https://www.linkedin.com/in/dbtsai
> >
> >
> > On Mon, May 19, 2014 at 7:42 PM, DB Tsai <db...@stanford.edu> wrote:
> >
> > > Good summary! We fixed it in branch 0.9 since our production is still
> in
> > > 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0
> > > tonight.
> > >
> > >
> > > Sincerely,
> > >
> > > DB Tsai
> > > -------------------------------------------------------
> > > My Blog: https://www.dbtsai.com
> > > LinkedIn: https://www.linkedin.com/in/dbtsai
> > >
> > >
> > > On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <sandy.ryza@cloudera.com
> > >wrote:
> > >
> > >> It just hit me why this problem is showing up on YARN and not on
> > >> standalone.
> > >>
> > >> The relevant difference between YARN and standalone is that, on YARN,
> > the
> > >> app jar is loaded by the system classloader instead of Spark's custom
> > URL
> > >> classloader.
> > >>
> > >> On YARN, the system classloader knows about [the classes in the spark
> > >> jars,
> > >> the classes in the primary app jar].   The custom classloader knows
> > about
> > >> [the classes in secondary app jars] and has the system classloader as
> > its
> > >> parent.
> > >>
> > >> A few relevant facts (mostly redundant with what Sean pointed out):
> > >> * Every class has a classloader that loaded it.
> > >> * When an object of class B is instantiated inside of class A, the
> > >> classloader used for loading B is the classloader that was used for
> > >> loading
> > >> A.
> > >> * When a classloader fails to load a class, it lets its parent
> > classloader
> > >> try.  If its parent succeeds, its parent becomes the "classloader that
> > >> loaded it".
> > >>
> > >> So suppose class B is in a secondary app jar and class A is in the
> > primary
> > >> app jar:
> > >> 1. The custom classloader will try to load class A.
> > >> 2. It will fail, because it only knows about the secondary jars.
> > >> 3. It will delegate to its parent, the system classloader.
> > >> 4. The system classloader will succeed, because it knows about the
> > primary
> > >> app jar.
> > >> 5. A's classloader will be the system classloader.
> > >> 6. A tries to instantiate an instance of class B.
> > >> 7. B will be loaded with A's classloader, which is the system
> > classloader.
> > >> 8. Loading B will fail, because A's classloader, which is the system
> > >> classloader, doesn't know about the secondary app jars.
> > >>
> > >> In Spark standalone, A and B are both loaded by the custom
> classloader,
> > so
> > >> this issue doesn't come up.
> > >>
> > >> -Sandy
> > >>
> > >> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <pw...@gmail.com>
> > >> wrote:
> > >>
> > >> > Having a user add define a custom class inside of an added jar and
> > >> > instantiate it directly inside of an executor is definitely
> supported
> > >> > in Spark and has been for a really long time (several years). This
> is
> > >> > something we do all the time in Spark.
> > >> >
> > >> > DB - I'd hold off on a re-architecting of this until we identify
> > >> > exactly what is causing the bug you are running into.
> > >> >
> > >> > In a nutshell, when the bytecode "new Foo()" is run on the executor,
> > >> > it will ask the driver for the class over HTTP using a custom
> > >> > classloader. Something in that pipeline is breaking here, possibly
> > >> > related to the YARN deployment stuff.
> > >> >
> > >> >
> > >> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <so...@cloudera.com>
> > wrote:
> > >> > > I don't think a customer classloader is necessary.
> > >> > >
> > >> > > Well, it occurs to me that this is no new problem. Hadoop, Tomcat,
> > etc
> > >> > > all run custom user code that creates new user objects without
> > >> > > reflection. I should go see how that's done. Maybe it's totally
> > valid
> > >> > > to set the thread's context classloader for just this purpose,
> and I
> > >> > > am not thinking clearly.
> > >> > >
> > >> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <andrew@andrewash.com
> >
> > >> > wrote:
> > >> > >> Sounds like the problem is that classloaders always look in their
> > >> > parents
> > >> > >> before themselves, and Spark users want executors to pick up
> > classes
> > >> > from
> > >> > >> their custom code before the ones in Spark plus its dependencies.
> > >> > >>
> > >> > >> Would a custom classloader that delegates to the parent after
> first
> > >> > >> checking itself fix this up?
> > >> > >>
> > >> > >>
> > >> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <db...@stanford.edu>
> > >> wrote:
> > >> > >>
> > >> > >>> Hi Sean,
> > >> > >>>
> > >> > >>> It's true that the issue here is classloader, and due to the
> > >> > classloader
> > >> > >>> delegation model, users have to use reflection in the executors
> to
> > >> > pick up
> > >> > >>> the classloader in order to use those classes added by
> sc.addJars
> > >> APIs.
> > >> > >>> However, it's very inconvenience for users, and not documented
> in
> > >> > spark.
> > >> > >>>
> > >> > >>> I'm working on a patch to solve it by calling the protected
> method
> > >> > addURL
> > >> > >>> in URLClassLoader to update the current default classloader, so
> no
> > >> > >>> customClassLoader anymore. I wonder if this is an good way to
> go.
> > >> > >>>
> > >> > >>>   private def addURL(url: URL, loader: URLClassLoader){
> > >> > >>>     try {
> > >> > >>>       val method: Method =
> > >> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL",
> classOf[URL])
> > >> > >>>       method.setAccessible(true)
> > >> > >>>       method.invoke(loader, url)
> > >> > >>>     }
> > >> > >>>     catch {
> > >> > >>>       case t: Throwable => {
> > >> > >>>         throw new IOException("Error, could not add URL to
> system
> > >> > >>> classloader")
> > >> > >>>       }
> > >> > >>>     }
> > >> > >>>   }
> > >> > >>>
> > >> > >>>
> > >> > >>>
> > >> > >>> Sincerely,
> > >> > >>>
> > >> > >>> DB Tsai
> > >> > >>> -------------------------------------------------------
> > >> > >>> My Blog: https://www.dbtsai.com
> > >> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai
> > >> > >>>
> > >> > >>>
> > >> > >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <sowen@cloudera.com
> >
> > >> > wrote:
> > >> > >>>
> > >> > >>> > I might be stating the obvious for everyone, but the issue
> here
> > is
> > >> > not
> > >> > >>> > reflection or the source of the JAR, but the ClassLoader. The
> > >> basic
> > >> > >>> > rules are this.
> > >> > >>> >
> > >> > >>> > "new Foo" will use the ClassLoader that defines Foo. This is
> > >> usually
> > >> > >>> > the ClassLoader that loaded whatever it is that first
> referenced
> > >> Foo
> > >> > >>> > and caused it to be loaded -- usually the ClassLoader holding
> > your
> > >> > >>> > other app classes.
> > >> > >>> >
> > >> > >>> > ClassLoaders can have a parent-child relationship.
> ClassLoaders
> > >> > always
> > >> > >>> > look in their parent before themselves.
> > >> > >>> >
> > >> > >>> > (Careful then -- in contexts like Hadoop or Tomcat where your
> > app
> > >> is
> > >> > >>> > loaded in a child ClassLoader, and you reference a class that
> > >> Hadoop
> > >> > >>> > or Tomcat also has (like a lib class) you will get the
> > container's
> > >> > >>> > version!)
> > >> > >>> >
> > >> > >>> > When you load an external JAR it has a separate ClassLoader
> > which
> > >> > does
> > >> > >>> > not necessarily bear any relation to the one containing your
> app
> > >> > >>> > classes, so yeah it is not generally going to make "new Foo"
> > work.
> > >> > >>> >
> > >> > >>> > Reflection lets you pick the ClassLoader, yes.
> > >> > >>> >
> > >> > >>> > I would not call setContextClassLoader.
> > >> > >>> >
> > >> > >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <
> > >> > sandy.ryza@cloudera.com>
> > >> > >>> > wrote:
> > >> > >>> > > I spoke with DB offline about this a little while ago and he
> > >> > confirmed
> > >> > >>> > that
> > >> > >>> > > he was able to access the jar from the driver.
> > >> > >>> > >
> > >> > >>> > > The issue appears to be a general Java issue: you can't
> > directly
> > >> > >>> > > instantiate a class from a dynamically loaded jar.
> > >> > >>> > >
> > >> > >>> > > I reproduced it locally outside of Spark with:
> > >> > >>> > > ---
> > >> > >>> > >     URLClassLoader urlClassLoader = new URLClassLoader(new
> > >> URL[] {
> > >> > new
> > >> > >>> > > File("myotherjar.jar").toURI().toURL() }, null);
> > >> > >>> > >
> > >> Thread.currentThread().setContextClassLoader(urlClassLoader);
> > >> > >>> > >     MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
> > >> > >>> > > ---
> > >> > >>> > >
> > >> > >>> > > I was able to load the class with reflection.
> > >> > >>> >
> > >> > >>>
> > >> >
> > >>
> > >
> > >
> >
>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Koert Kuipers <ko...@tresata.com>.
db tsai, i do not think userClassPathFirst is working, unless the classes
you load dont reference any classes already loaded by the parent
classloader (a mostly hypothetical situation)... i filed a jira for this
here:
https://issues.apache.org/jira/browse/SPARK-1863



On Tue, May 20, 2014 at 1:04 AM, DB Tsai <db...@stanford.edu> wrote:

> In 1.0, there is a new option for users to choose which classloader has
> higher priority via spark.files.userClassPathFirst, I decided to submit the
> PR for 0.9 first. We use this patch in our lab and we can use those jars
> added by sc.addJar without reflection.
>
> https://github.com/apache/spark/pull/834
>
> Can anyone comment if it's a good approach?
>
> Thanks.
>
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Mon, May 19, 2014 at 7:42 PM, DB Tsai <db...@stanford.edu> wrote:
>
> > Good summary! We fixed it in branch 0.9 since our production is still in
> > 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0
> > tonight.
> >
> >
> > Sincerely,
> >
> > DB Tsai
> > -------------------------------------------------------
> > My Blog: https://www.dbtsai.com
> > LinkedIn: https://www.linkedin.com/in/dbtsai
> >
> >
> > On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <sandy.ryza@cloudera.com
> >wrote:
> >
> >> It just hit me why this problem is showing up on YARN and not on
> >> standalone.
> >>
> >> The relevant difference between YARN and standalone is that, on YARN,
> the
> >> app jar is loaded by the system classloader instead of Spark's custom
> URL
> >> classloader.
> >>
> >> On YARN, the system classloader knows about [the classes in the spark
> >> jars,
> >> the classes in the primary app jar].   The custom classloader knows
> about
> >> [the classes in secondary app jars] and has the system classloader as
> its
> >> parent.
> >>
> >> A few relevant facts (mostly redundant with what Sean pointed out):
> >> * Every class has a classloader that loaded it.
> >> * When an object of class B is instantiated inside of class A, the
> >> classloader used for loading B is the classloader that was used for
> >> loading
> >> A.
> >> * When a classloader fails to load a class, it lets its parent
> classloader
> >> try.  If its parent succeeds, its parent becomes the "classloader that
> >> loaded it".
> >>
> >> So suppose class B is in a secondary app jar and class A is in the
> primary
> >> app jar:
> >> 1. The custom classloader will try to load class A.
> >> 2. It will fail, because it only knows about the secondary jars.
> >> 3. It will delegate to its parent, the system classloader.
> >> 4. The system classloader will succeed, because it knows about the
> primary
> >> app jar.
> >> 5. A's classloader will be the system classloader.
> >> 6. A tries to instantiate an instance of class B.
> >> 7. B will be loaded with A's classloader, which is the system
> classloader.
> >> 8. Loading B will fail, because A's classloader, which is the system
> >> classloader, doesn't know about the secondary app jars.
> >>
> >> In Spark standalone, A and B are both loaded by the custom classloader,
> so
> >> this issue doesn't come up.
> >>
> >> -Sandy
> >>
> >> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <pw...@gmail.com>
> >> wrote:
> >>
> >> > Having a user add define a custom class inside of an added jar and
> >> > instantiate it directly inside of an executor is definitely supported
> >> > in Spark and has been for a really long time (several years). This is
> >> > something we do all the time in Spark.
> >> >
> >> > DB - I'd hold off on a re-architecting of this until we identify
> >> > exactly what is causing the bug you are running into.
> >> >
> >> > In a nutshell, when the bytecode "new Foo()" is run on the executor,
> >> > it will ask the driver for the class over HTTP using a custom
> >> > classloader. Something in that pipeline is breaking here, possibly
> >> > related to the YARN deployment stuff.
> >> >
> >> >
> >> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <so...@cloudera.com>
> wrote:
> >> > > I don't think a customer classloader is necessary.
> >> > >
> >> > > Well, it occurs to me that this is no new problem. Hadoop, Tomcat,
> etc
> >> > > all run custom user code that creates new user objects without
> >> > > reflection. I should go see how that's done. Maybe it's totally
> valid
> >> > > to set the thread's context classloader for just this purpose, and I
> >> > > am not thinking clearly.
> >> > >
> >> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <an...@andrewash.com>
> >> > wrote:
> >> > >> Sounds like the problem is that classloaders always look in their
> >> > parents
> >> > >> before themselves, and Spark users want executors to pick up
> classes
> >> > from
> >> > >> their custom code before the ones in Spark plus its dependencies.
> >> > >>
> >> > >> Would a custom classloader that delegates to the parent after first
> >> > >> checking itself fix this up?
> >> > >>
> >> > >>
> >> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <db...@stanford.edu>
> >> wrote:
> >> > >>
> >> > >>> Hi Sean,
> >> > >>>
> >> > >>> It's true that the issue here is classloader, and due to the
> >> > classloader
> >> > >>> delegation model, users have to use reflection in the executors to
> >> > pick up
> >> > >>> the classloader in order to use those classes added by sc.addJars
> >> APIs.
> >> > >>> However, it's very inconvenience for users, and not documented in
> >> > spark.
> >> > >>>
> >> > >>> I'm working on a patch to solve it by calling the protected method
> >> > addURL
> >> > >>> in URLClassLoader to update the current default classloader, so no
> >> > >>> customClassLoader anymore. I wonder if this is an good way to go.
> >> > >>>
> >> > >>>   private def addURL(url: URL, loader: URLClassLoader){
> >> > >>>     try {
> >> > >>>       val method: Method =
> >> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL", classOf[URL])
> >> > >>>       method.setAccessible(true)
> >> > >>>       method.invoke(loader, url)
> >> > >>>     }
> >> > >>>     catch {
> >> > >>>       case t: Throwable => {
> >> > >>>         throw new IOException("Error, could not add URL to system
> >> > >>> classloader")
> >> > >>>       }
> >> > >>>     }
> >> > >>>   }
> >> > >>>
> >> > >>>
> >> > >>>
> >> > >>> Sincerely,
> >> > >>>
> >> > >>> DB Tsai
> >> > >>> -------------------------------------------------------
> >> > >>> My Blog: https://www.dbtsai.com
> >> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai
> >> > >>>
> >> > >>>
> >> > >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <so...@cloudera.com>
> >> > wrote:
> >> > >>>
> >> > >>> > I might be stating the obvious for everyone, but the issue here
> is
> >> > not
> >> > >>> > reflection or the source of the JAR, but the ClassLoader. The
> >> basic
> >> > >>> > rules are this.
> >> > >>> >
> >> > >>> > "new Foo" will use the ClassLoader that defines Foo. This is
> >> usually
> >> > >>> > the ClassLoader that loaded whatever it is that first referenced
> >> Foo
> >> > >>> > and caused it to be loaded -- usually the ClassLoader holding
> your
> >> > >>> > other app classes.
> >> > >>> >
> >> > >>> > ClassLoaders can have a parent-child relationship. ClassLoaders
> >> > always
> >> > >>> > look in their parent before themselves.
> >> > >>> >
> >> > >>> > (Careful then -- in contexts like Hadoop or Tomcat where your
> app
> >> is
> >> > >>> > loaded in a child ClassLoader, and you reference a class that
> >> Hadoop
> >> > >>> > or Tomcat also has (like a lib class) you will get the
> container's
> >> > >>> > version!)
> >> > >>> >
> >> > >>> > When you load an external JAR it has a separate ClassLoader
> which
> >> > does
> >> > >>> > not necessarily bear any relation to the one containing your app
> >> > >>> > classes, so yeah it is not generally going to make "new Foo"
> work.
> >> > >>> >
> >> > >>> > Reflection lets you pick the ClassLoader, yes.
> >> > >>> >
> >> > >>> > I would not call setContextClassLoader.
> >> > >>> >
> >> > >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <
> >> > sandy.ryza@cloudera.com>
> >> > >>> > wrote:
> >> > >>> > > I spoke with DB offline about this a little while ago and he
> >> > confirmed
> >> > >>> > that
> >> > >>> > > he was able to access the jar from the driver.
> >> > >>> > >
> >> > >>> > > The issue appears to be a general Java issue: you can't
> directly
> >> > >>> > > instantiate a class from a dynamically loaded jar.
> >> > >>> > >
> >> > >>> > > I reproduced it locally outside of Spark with:
> >> > >>> > > ---
> >> > >>> > >     URLClassLoader urlClassLoader = new URLClassLoader(new
> >> URL[] {
> >> > new
> >> > >>> > > File("myotherjar.jar").toURI().toURL() }, null);
> >> > >>> > >
> >> Thread.currentThread().setContextClassLoader(urlClassLoader);
> >> > >>> > >     MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
> >> > >>> > > ---
> >> > >>> > >
> >> > >>> > > I was able to load the class with reflection.
> >> > >>> >
> >> > >>>
> >> >
> >>
> >
> >
>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by DB Tsai <db...@stanford.edu>.
In 1.0, there is a new option for users to choose which classloader has
higher priority via spark.files.userClassPathFirst, I decided to submit the
PR for 0.9 first. We use this patch in our lab and we can use those jars
added by sc.addJar without reflection.

https://github.com/apache/spark/pull/834

Can anyone comment if it's a good approach?

Thanks.


Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, May 19, 2014 at 7:42 PM, DB Tsai <db...@stanford.edu> wrote:

> Good summary! We fixed it in branch 0.9 since our production is still in
> 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0
> tonight.
>
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <sa...@cloudera.com>wrote:
>
>> It just hit me why this problem is showing up on YARN and not on
>> standalone.
>>
>> The relevant difference between YARN and standalone is that, on YARN, the
>> app jar is loaded by the system classloader instead of Spark's custom URL
>> classloader.
>>
>> On YARN, the system classloader knows about [the classes in the spark
>> jars,
>> the classes in the primary app jar].   The custom classloader knows about
>> [the classes in secondary app jars] and has the system classloader as its
>> parent.
>>
>> A few relevant facts (mostly redundant with what Sean pointed out):
>> * Every class has a classloader that loaded it.
>> * When an object of class B is instantiated inside of class A, the
>> classloader used for loading B is the classloader that was used for
>> loading
>> A.
>> * When a classloader fails to load a class, it lets its parent classloader
>> try.  If its parent succeeds, its parent becomes the "classloader that
>> loaded it".
>>
>> So suppose class B is in a secondary app jar and class A is in the primary
>> app jar:
>> 1. The custom classloader will try to load class A.
>> 2. It will fail, because it only knows about the secondary jars.
>> 3. It will delegate to its parent, the system classloader.
>> 4. The system classloader will succeed, because it knows about the primary
>> app jar.
>> 5. A's classloader will be the system classloader.
>> 6. A tries to instantiate an instance of class B.
>> 7. B will be loaded with A's classloader, which is the system classloader.
>> 8. Loading B will fail, because A's classloader, which is the system
>> classloader, doesn't know about the secondary app jars.
>>
>> In Spark standalone, A and B are both loaded by the custom classloader, so
>> this issue doesn't come up.
>>
>> -Sandy
>>
>> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <pw...@gmail.com>
>> wrote:
>>
>> > Having a user add define a custom class inside of an added jar and
>> > instantiate it directly inside of an executor is definitely supported
>> > in Spark and has been for a really long time (several years). This is
>> > something we do all the time in Spark.
>> >
>> > DB - I'd hold off on a re-architecting of this until we identify
>> > exactly what is causing the bug you are running into.
>> >
>> > In a nutshell, when the bytecode "new Foo()" is run on the executor,
>> > it will ask the driver for the class over HTTP using a custom
>> > classloader. Something in that pipeline is breaking here, possibly
>> > related to the YARN deployment stuff.
>> >
>> >
>> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <so...@cloudera.com> wrote:
>> > > I don't think a customer classloader is necessary.
>> > >
>> > > Well, it occurs to me that this is no new problem. Hadoop, Tomcat, etc
>> > > all run custom user code that creates new user objects without
>> > > reflection. I should go see how that's done. Maybe it's totally valid
>> > > to set the thread's context classloader for just this purpose, and I
>> > > am not thinking clearly.
>> > >
>> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <an...@andrewash.com>
>> > wrote:
>> > >> Sounds like the problem is that classloaders always look in their
>> > parents
>> > >> before themselves, and Spark users want executors to pick up classes
>> > from
>> > >> their custom code before the ones in Spark plus its dependencies.
>> > >>
>> > >> Would a custom classloader that delegates to the parent after first
>> > >> checking itself fix this up?
>> > >>
>> > >>
>> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <db...@stanford.edu>
>> wrote:
>> > >>
>> > >>> Hi Sean,
>> > >>>
>> > >>> It's true that the issue here is classloader, and due to the
>> > classloader
>> > >>> delegation model, users have to use reflection in the executors to
>> > pick up
>> > >>> the classloader in order to use those classes added by sc.addJars
>> APIs.
>> > >>> However, it's very inconvenience for users, and not documented in
>> > spark.
>> > >>>
>> > >>> I'm working on a patch to solve it by calling the protected method
>> > addURL
>> > >>> in URLClassLoader to update the current default classloader, so no
>> > >>> customClassLoader anymore. I wonder if this is an good way to go.
>> > >>>
>> > >>>   private def addURL(url: URL, loader: URLClassLoader){
>> > >>>     try {
>> > >>>       val method: Method =
>> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL", classOf[URL])
>> > >>>       method.setAccessible(true)
>> > >>>       method.invoke(loader, url)
>> > >>>     }
>> > >>>     catch {
>> > >>>       case t: Throwable => {
>> > >>>         throw new IOException("Error, could not add URL to system
>> > >>> classloader")
>> > >>>       }
>> > >>>     }
>> > >>>   }
>> > >>>
>> > >>>
>> > >>>
>> > >>> Sincerely,
>> > >>>
>> > >>> DB Tsai
>> > >>> -------------------------------------------------------
>> > >>> My Blog: https://www.dbtsai.com
>> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai
>> > >>>
>> > >>>
>> > >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <so...@cloudera.com>
>> > wrote:
>> > >>>
>> > >>> > I might be stating the obvious for everyone, but the issue here is
>> > not
>> > >>> > reflection or the source of the JAR, but the ClassLoader. The
>> basic
>> > >>> > rules are this.
>> > >>> >
>> > >>> > "new Foo" will use the ClassLoader that defines Foo. This is
>> usually
>> > >>> > the ClassLoader that loaded whatever it is that first referenced
>> Foo
>> > >>> > and caused it to be loaded -- usually the ClassLoader holding your
>> > >>> > other app classes.
>> > >>> >
>> > >>> > ClassLoaders can have a parent-child relationship. ClassLoaders
>> > always
>> > >>> > look in their parent before themselves.
>> > >>> >
>> > >>> > (Careful then -- in contexts like Hadoop or Tomcat where your app
>> is
>> > >>> > loaded in a child ClassLoader, and you reference a class that
>> Hadoop
>> > >>> > or Tomcat also has (like a lib class) you will get the container's
>> > >>> > version!)
>> > >>> >
>> > >>> > When you load an external JAR it has a separate ClassLoader which
>> > does
>> > >>> > not necessarily bear any relation to the one containing your app
>> > >>> > classes, so yeah it is not generally going to make "new Foo" work.
>> > >>> >
>> > >>> > Reflection lets you pick the ClassLoader, yes.
>> > >>> >
>> > >>> > I would not call setContextClassLoader.
>> > >>> >
>> > >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <
>> > sandy.ryza@cloudera.com>
>> > >>> > wrote:
>> > >>> > > I spoke with DB offline about this a little while ago and he
>> > confirmed
>> > >>> > that
>> > >>> > > he was able to access the jar from the driver.
>> > >>> > >
>> > >>> > > The issue appears to be a general Java issue: you can't directly
>> > >>> > > instantiate a class from a dynamically loaded jar.
>> > >>> > >
>> > >>> > > I reproduced it locally outside of Spark with:
>> > >>> > > ---
>> > >>> > >     URLClassLoader urlClassLoader = new URLClassLoader(new
>> URL[] {
>> > new
>> > >>> > > File("myotherjar.jar").toURI().toURL() }, null);
>> > >>> > >
>> Thread.currentThread().setContextClassLoader(urlClassLoader);
>> > >>> > >     MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
>> > >>> > > ---
>> > >>> > >
>> > >>> > > I was able to load the class with reflection.
>> > >>> >
>> > >>>
>> >
>>
>
>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by DB Tsai <db...@stanford.edu>.
Good summary! We fixed it in branch 0.9 since our production is still in
0.9. I'm porting it to 1.0 now, and hopefully will submit PR for 1.0
tonight.


Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <sa...@cloudera.com> wrote:

> It just hit me why this problem is showing up on YARN and not on
> standalone.
>
> The relevant difference between YARN and standalone is that, on YARN, the
> app jar is loaded by the system classloader instead of Spark's custom URL
> classloader.
>
> On YARN, the system classloader knows about [the classes in the spark jars,
> the classes in the primary app jar].   The custom classloader knows about
> [the classes in secondary app jars] and has the system classloader as its
> parent.
>
> A few relevant facts (mostly redundant with what Sean pointed out):
> * Every class has a classloader that loaded it.
> * When an object of class B is instantiated inside of class A, the
> classloader used for loading B is the classloader that was used for loading
> A.
> * When a classloader fails to load a class, it lets its parent classloader
> try.  If its parent succeeds, its parent becomes the "classloader that
> loaded it".
>
> So suppose class B is in a secondary app jar and class A is in the primary
> app jar:
> 1. The custom classloader will try to load class A.
> 2. It will fail, because it only knows about the secondary jars.
> 3. It will delegate to its parent, the system classloader.
> 4. The system classloader will succeed, because it knows about the primary
> app jar.
> 5. A's classloader will be the system classloader.
> 6. A tries to instantiate an instance of class B.
> 7. B will be loaded with A's classloader, which is the system classloader.
> 8. Loading B will fail, because A's classloader, which is the system
> classloader, doesn't know about the secondary app jars.
>
> In Spark standalone, A and B are both loaded by the custom classloader, so
> this issue doesn't come up.
>
> -Sandy
>
> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <pw...@gmail.com>
> wrote:
>
> > Having a user add define a custom class inside of an added jar and
> > instantiate it directly inside of an executor is definitely supported
> > in Spark and has been for a really long time (several years). This is
> > something we do all the time in Spark.
> >
> > DB - I'd hold off on a re-architecting of this until we identify
> > exactly what is causing the bug you are running into.
> >
> > In a nutshell, when the bytecode "new Foo()" is run on the executor,
> > it will ask the driver for the class over HTTP using a custom
> > classloader. Something in that pipeline is breaking here, possibly
> > related to the YARN deployment stuff.
> >
> >
> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <so...@cloudera.com> wrote:
> > > I don't think a customer classloader is necessary.
> > >
> > > Well, it occurs to me that this is no new problem. Hadoop, Tomcat, etc
> > > all run custom user code that creates new user objects without
> > > reflection. I should go see how that's done. Maybe it's totally valid
> > > to set the thread's context classloader for just this purpose, and I
> > > am not thinking clearly.
> > >
> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <an...@andrewash.com>
> > wrote:
> > >> Sounds like the problem is that classloaders always look in their
> > parents
> > >> before themselves, and Spark users want executors to pick up classes
> > from
> > >> their custom code before the ones in Spark plus its dependencies.
> > >>
> > >> Would a custom classloader that delegates to the parent after first
> > >> checking itself fix this up?
> > >>
> > >>
> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <db...@stanford.edu>
> wrote:
> > >>
> > >>> Hi Sean,
> > >>>
> > >>> It's true that the issue here is classloader, and due to the
> > classloader
> > >>> delegation model, users have to use reflection in the executors to
> > pick up
> > >>> the classloader in order to use those classes added by sc.addJars
> APIs.
> > >>> However, it's very inconvenience for users, and not documented in
> > spark.
> > >>>
> > >>> I'm working on a patch to solve it by calling the protected method
> > addURL
> > >>> in URLClassLoader to update the current default classloader, so no
> > >>> customClassLoader anymore. I wonder if this is an good way to go.
> > >>>
> > >>>   private def addURL(url: URL, loader: URLClassLoader){
> > >>>     try {
> > >>>       val method: Method =
> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL", classOf[URL])
> > >>>       method.setAccessible(true)
> > >>>       method.invoke(loader, url)
> > >>>     }
> > >>>     catch {
> > >>>       case t: Throwable => {
> > >>>         throw new IOException("Error, could not add URL to system
> > >>> classloader")
> > >>>       }
> > >>>     }
> > >>>   }
> > >>>
> > >>>
> > >>>
> > >>> Sincerely,
> > >>>
> > >>> DB Tsai
> > >>> -------------------------------------------------------
> > >>> My Blog: https://www.dbtsai.com
> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai
> > >>>
> > >>>
> > >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <so...@cloudera.com>
> > wrote:
> > >>>
> > >>> > I might be stating the obvious for everyone, but the issue here is
> > not
> > >>> > reflection or the source of the JAR, but the ClassLoader. The basic
> > >>> > rules are this.
> > >>> >
> > >>> > "new Foo" will use the ClassLoader that defines Foo. This is
> usually
> > >>> > the ClassLoader that loaded whatever it is that first referenced
> Foo
> > >>> > and caused it to be loaded -- usually the ClassLoader holding your
> > >>> > other app classes.
> > >>> >
> > >>> > ClassLoaders can have a parent-child relationship. ClassLoaders
> > always
> > >>> > look in their parent before themselves.
> > >>> >
> > >>> > (Careful then -- in contexts like Hadoop or Tomcat where your app
> is
> > >>> > loaded in a child ClassLoader, and you reference a class that
> Hadoop
> > >>> > or Tomcat also has (like a lib class) you will get the container's
> > >>> > version!)
> > >>> >
> > >>> > When you load an external JAR it has a separate ClassLoader which
> > does
> > >>> > not necessarily bear any relation to the one containing your app
> > >>> > classes, so yeah it is not generally going to make "new Foo" work.
> > >>> >
> > >>> > Reflection lets you pick the ClassLoader, yes.
> > >>> >
> > >>> > I would not call setContextClassLoader.
> > >>> >
> > >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <
> > sandy.ryza@cloudera.com>
> > >>> > wrote:
> > >>> > > I spoke with DB offline about this a little while ago and he
> > confirmed
> > >>> > that
> > >>> > > he was able to access the jar from the driver.
> > >>> > >
> > >>> > > The issue appears to be a general Java issue: you can't directly
> > >>> > > instantiate a class from a dynamically loaded jar.
> > >>> > >
> > >>> > > I reproduced it locally outside of Spark with:
> > >>> > > ---
> > >>> > >     URLClassLoader urlClassLoader = new URLClassLoader(new URL[]
> {
> > new
> > >>> > > File("myotherjar.jar").toURI().toURL() }, null);
> > >>> > >     Thread.currentThread().setContextClassLoader(urlClassLoader);
> > >>> > >     MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
> > >>> > > ---
> > >>> > >
> > >>> > > I was able to load the class with reflection.
> > >>> >
> > >>>
> >
>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Sandy Ryza <sa...@cloudera.com>.
It just hit me why this problem is showing up on YARN and not on standalone.

The relevant difference between YARN and standalone is that, on YARN, the
app jar is loaded by the system classloader instead of Spark's custom URL
classloader.

On YARN, the system classloader knows about [the classes in the spark jars,
the classes in the primary app jar].   The custom classloader knows about
[the classes in secondary app jars] and has the system classloader as its
parent.

A few relevant facts (mostly redundant with what Sean pointed out):
* Every class has a classloader that loaded it.
* When an object of class B is instantiated inside of class A, the
classloader used for loading B is the classloader that was used for loading
A.
* When a classloader fails to load a class, it lets its parent classloader
try.  If its parent succeeds, its parent becomes the "classloader that
loaded it".

So suppose class B is in a secondary app jar and class A is in the primary
app jar:
1. The custom classloader will try to load class A.
2. It will fail, because it only knows about the secondary jars.
3. It will delegate to its parent, the system classloader.
4. The system classloader will succeed, because it knows about the primary
app jar.
5. A's classloader will be the system classloader.
6. A tries to instantiate an instance of class B.
7. B will be loaded with A's classloader, which is the system classloader.
8. Loading B will fail, because A's classloader, which is the system
classloader, doesn't know about the secondary app jars.

In Spark standalone, A and B are both loaded by the custom classloader, so
this issue doesn't come up.

-Sandy

On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <pw...@gmail.com> wrote:

> Having a user add define a custom class inside of an added jar and
> instantiate it directly inside of an executor is definitely supported
> in Spark and has been for a really long time (several years). This is
> something we do all the time in Spark.
>
> DB - I'd hold off on a re-architecting of this until we identify
> exactly what is causing the bug you are running into.
>
> In a nutshell, when the bytecode "new Foo()" is run on the executor,
> it will ask the driver for the class over HTTP using a custom
> classloader. Something in that pipeline is breaking here, possibly
> related to the YARN deployment stuff.
>
>
> On Mon, May 19, 2014 at 12:29 AM, Sean Owen <so...@cloudera.com> wrote:
> > I don't think a customer classloader is necessary.
> >
> > Well, it occurs to me that this is no new problem. Hadoop, Tomcat, etc
> > all run custom user code that creates new user objects without
> > reflection. I should go see how that's done. Maybe it's totally valid
> > to set the thread's context classloader for just this purpose, and I
> > am not thinking clearly.
> >
> > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <an...@andrewash.com>
> wrote:
> >> Sounds like the problem is that classloaders always look in their
> parents
> >> before themselves, and Spark users want executors to pick up classes
> from
> >> their custom code before the ones in Spark plus its dependencies.
> >>
> >> Would a custom classloader that delegates to the parent after first
> >> checking itself fix this up?
> >>
> >>
> >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <db...@stanford.edu> wrote:
> >>
> >>> Hi Sean,
> >>>
> >>> It's true that the issue here is classloader, and due to the
> classloader
> >>> delegation model, users have to use reflection in the executors to
> pick up
> >>> the classloader in order to use those classes added by sc.addJars APIs.
> >>> However, it's very inconvenience for users, and not documented in
> spark.
> >>>
> >>> I'm working on a patch to solve it by calling the protected method
> addURL
> >>> in URLClassLoader to update the current default classloader, so no
> >>> customClassLoader anymore. I wonder if this is an good way to go.
> >>>
> >>>   private def addURL(url: URL, loader: URLClassLoader){
> >>>     try {
> >>>       val method: Method =
> >>> classOf[URLClassLoader].getDeclaredMethod("addURL", classOf[URL])
> >>>       method.setAccessible(true)
> >>>       method.invoke(loader, url)
> >>>     }
> >>>     catch {
> >>>       case t: Throwable => {
> >>>         throw new IOException("Error, could not add URL to system
> >>> classloader")
> >>>       }
> >>>     }
> >>>   }
> >>>
> >>>
> >>>
> >>> Sincerely,
> >>>
> >>> DB Tsai
> >>> -------------------------------------------------------
> >>> My Blog: https://www.dbtsai.com
> >>> LinkedIn: https://www.linkedin.com/in/dbtsai
> >>>
> >>>
> >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <so...@cloudera.com>
> wrote:
> >>>
> >>> > I might be stating the obvious for everyone, but the issue here is
> not
> >>> > reflection or the source of the JAR, but the ClassLoader. The basic
> >>> > rules are this.
> >>> >
> >>> > "new Foo" will use the ClassLoader that defines Foo. This is usually
> >>> > the ClassLoader that loaded whatever it is that first referenced Foo
> >>> > and caused it to be loaded -- usually the ClassLoader holding your
> >>> > other app classes.
> >>> >
> >>> > ClassLoaders can have a parent-child relationship. ClassLoaders
> always
> >>> > look in their parent before themselves.
> >>> >
> >>> > (Careful then -- in contexts like Hadoop or Tomcat where your app is
> >>> > loaded in a child ClassLoader, and you reference a class that Hadoop
> >>> > or Tomcat also has (like a lib class) you will get the container's
> >>> > version!)
> >>> >
> >>> > When you load an external JAR it has a separate ClassLoader which
> does
> >>> > not necessarily bear any relation to the one containing your app
> >>> > classes, so yeah it is not generally going to make "new Foo" work.
> >>> >
> >>> > Reflection lets you pick the ClassLoader, yes.
> >>> >
> >>> > I would not call setContextClassLoader.
> >>> >
> >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <
> sandy.ryza@cloudera.com>
> >>> > wrote:
> >>> > > I spoke with DB offline about this a little while ago and he
> confirmed
> >>> > that
> >>> > > he was able to access the jar from the driver.
> >>> > >
> >>> > > The issue appears to be a general Java issue: you can't directly
> >>> > > instantiate a class from a dynamically loaded jar.
> >>> > >
> >>> > > I reproduced it locally outside of Spark with:
> >>> > > ---
> >>> > >     URLClassLoader urlClassLoader = new URLClassLoader(new URL[] {
> new
> >>> > > File("myotherjar.jar").toURI().toURL() }, null);
> >>> > >     Thread.currentThread().setContextClassLoader(urlClassLoader);
> >>> > >     MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
> >>> > > ---
> >>> > >
> >>> > > I was able to load the class with reflection.
> >>> >
> >>>
>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Patrick Wendell <pw...@gmail.com>.
Having a user add define a custom class inside of an added jar and
instantiate it directly inside of an executor is definitely supported
in Spark and has been for a really long time (several years). This is
something we do all the time in Spark.

DB - I'd hold off on a re-architecting of this until we identify
exactly what is causing the bug you are running into.

In a nutshell, when the bytecode "new Foo()" is run on the executor,
it will ask the driver for the class over HTTP using a custom
classloader. Something in that pipeline is breaking here, possibly
related to the YARN deployment stuff.


On Mon, May 19, 2014 at 12:29 AM, Sean Owen <so...@cloudera.com> wrote:
> I don't think a customer classloader is necessary.
>
> Well, it occurs to me that this is no new problem. Hadoop, Tomcat, etc
> all run custom user code that creates new user objects without
> reflection. I should go see how that's done. Maybe it's totally valid
> to set the thread's context classloader for just this purpose, and I
> am not thinking clearly.
>
> On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <an...@andrewash.com> wrote:
>> Sounds like the problem is that classloaders always look in their parents
>> before themselves, and Spark users want executors to pick up classes from
>> their custom code before the ones in Spark plus its dependencies.
>>
>> Would a custom classloader that delegates to the parent after first
>> checking itself fix this up?
>>
>>
>> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <db...@stanford.edu> wrote:
>>
>>> Hi Sean,
>>>
>>> It's true that the issue here is classloader, and due to the classloader
>>> delegation model, users have to use reflection in the executors to pick up
>>> the classloader in order to use those classes added by sc.addJars APIs.
>>> However, it's very inconvenience for users, and not documented in spark.
>>>
>>> I'm working on a patch to solve it by calling the protected method addURL
>>> in URLClassLoader to update the current default classloader, so no
>>> customClassLoader anymore. I wonder if this is an good way to go.
>>>
>>>   private def addURL(url: URL, loader: URLClassLoader){
>>>     try {
>>>       val method: Method =
>>> classOf[URLClassLoader].getDeclaredMethod("addURL", classOf[URL])
>>>       method.setAccessible(true)
>>>       method.invoke(loader, url)
>>>     }
>>>     catch {
>>>       case t: Throwable => {
>>>         throw new IOException("Error, could not add URL to system
>>> classloader")
>>>       }
>>>     }
>>>   }
>>>
>>>
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> -------------------------------------------------------
>>> My Blog: https://www.dbtsai.com
>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>
>>>
>>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>> > I might be stating the obvious for everyone, but the issue here is not
>>> > reflection or the source of the JAR, but the ClassLoader. The basic
>>> > rules are this.
>>> >
>>> > "new Foo" will use the ClassLoader that defines Foo. This is usually
>>> > the ClassLoader that loaded whatever it is that first referenced Foo
>>> > and caused it to be loaded -- usually the ClassLoader holding your
>>> > other app classes.
>>> >
>>> > ClassLoaders can have a parent-child relationship. ClassLoaders always
>>> > look in their parent before themselves.
>>> >
>>> > (Careful then -- in contexts like Hadoop or Tomcat where your app is
>>> > loaded in a child ClassLoader, and you reference a class that Hadoop
>>> > or Tomcat also has (like a lib class) you will get the container's
>>> > version!)
>>> >
>>> > When you load an external JAR it has a separate ClassLoader which does
>>> > not necessarily bear any relation to the one containing your app
>>> > classes, so yeah it is not generally going to make "new Foo" work.
>>> >
>>> > Reflection lets you pick the ClassLoader, yes.
>>> >
>>> > I would not call setContextClassLoader.
>>> >
>>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <sa...@cloudera.com>
>>> > wrote:
>>> > > I spoke with DB offline about this a little while ago and he confirmed
>>> > that
>>> > > he was able to access the jar from the driver.
>>> > >
>>> > > The issue appears to be a general Java issue: you can't directly
>>> > > instantiate a class from a dynamically loaded jar.
>>> > >
>>> > > I reproduced it locally outside of Spark with:
>>> > > ---
>>> > >     URLClassLoader urlClassLoader = new URLClassLoader(new URL[] { new
>>> > > File("myotherjar.jar").toURI().toURL() }, null);
>>> > >     Thread.currentThread().setContextClassLoader(urlClassLoader);
>>> > >     MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
>>> > > ---
>>> > >
>>> > > I was able to load the class with reflection.
>>> >
>>>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Sean Owen <so...@cloudera.com>.
I don't think a customer classloader is necessary.

Well, it occurs to me that this is no new problem. Hadoop, Tomcat, etc
all run custom user code that creates new user objects without
reflection. I should go see how that's done. Maybe it's totally valid
to set the thread's context classloader for just this purpose, and I
am not thinking clearly.

On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <an...@andrewash.com> wrote:
> Sounds like the problem is that classloaders always look in their parents
> before themselves, and Spark users want executors to pick up classes from
> their custom code before the ones in Spark plus its dependencies.
>
> Would a custom classloader that delegates to the parent after first
> checking itself fix this up?
>
>
> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <db...@stanford.edu> wrote:
>
>> Hi Sean,
>>
>> It's true that the issue here is classloader, and due to the classloader
>> delegation model, users have to use reflection in the executors to pick up
>> the classloader in order to use those classes added by sc.addJars APIs.
>> However, it's very inconvenience for users, and not documented in spark.
>>
>> I'm working on a patch to solve it by calling the protected method addURL
>> in URLClassLoader to update the current default classloader, so no
>> customClassLoader anymore. I wonder if this is an good way to go.
>>
>>   private def addURL(url: URL, loader: URLClassLoader){
>>     try {
>>       val method: Method =
>> classOf[URLClassLoader].getDeclaredMethod("addURL", classOf[URL])
>>       method.setAccessible(true)
>>       method.invoke(loader, url)
>>     }
>>     catch {
>>       case t: Throwable => {
>>         throw new IOException("Error, could not add URL to system
>> classloader")
>>       }
>>     }
>>   }
>>
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> -------------------------------------------------------
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> > I might be stating the obvious for everyone, but the issue here is not
>> > reflection or the source of the JAR, but the ClassLoader. The basic
>> > rules are this.
>> >
>> > "new Foo" will use the ClassLoader that defines Foo. This is usually
>> > the ClassLoader that loaded whatever it is that first referenced Foo
>> > and caused it to be loaded -- usually the ClassLoader holding your
>> > other app classes.
>> >
>> > ClassLoaders can have a parent-child relationship. ClassLoaders always
>> > look in their parent before themselves.
>> >
>> > (Careful then -- in contexts like Hadoop or Tomcat where your app is
>> > loaded in a child ClassLoader, and you reference a class that Hadoop
>> > or Tomcat also has (like a lib class) you will get the container's
>> > version!)
>> >
>> > When you load an external JAR it has a separate ClassLoader which does
>> > not necessarily bear any relation to the one containing your app
>> > classes, so yeah it is not generally going to make "new Foo" work.
>> >
>> > Reflection lets you pick the ClassLoader, yes.
>> >
>> > I would not call setContextClassLoader.
>> >
>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <sa...@cloudera.com>
>> > wrote:
>> > > I spoke with DB offline about this a little while ago and he confirmed
>> > that
>> > > he was able to access the jar from the driver.
>> > >
>> > > The issue appears to be a general Java issue: you can't directly
>> > > instantiate a class from a dynamically loaded jar.
>> > >
>> > > I reproduced it locally outside of Spark with:
>> > > ---
>> > >     URLClassLoader urlClassLoader = new URLClassLoader(new URL[] { new
>> > > File("myotherjar.jar").toURI().toURL() }, null);
>> > >     Thread.currentThread().setContextClassLoader(urlClassLoader);
>> > >     MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
>> > > ---
>> > >
>> > > I was able to load the class with reflection.
>> >
>>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Andrew Ash <an...@andrewash.com>.
Sounds like the problem is that classloaders always look in their parents
before themselves, and Spark users want executors to pick up classes from
their custom code before the ones in Spark plus its dependencies.

Would a custom classloader that delegates to the parent after first
checking itself fix this up?


On Mon, May 19, 2014 at 12:17 AM, DB Tsai <db...@stanford.edu> wrote:

> Hi Sean,
>
> It's true that the issue here is classloader, and due to the classloader
> delegation model, users have to use reflection in the executors to pick up
> the classloader in order to use those classes added by sc.addJars APIs.
> However, it's very inconvenience for users, and not documented in spark.
>
> I'm working on a patch to solve it by calling the protected method addURL
> in URLClassLoader to update the current default classloader, so no
> customClassLoader anymore. I wonder if this is an good way to go.
>
>   private def addURL(url: URL, loader: URLClassLoader){
>     try {
>       val method: Method =
> classOf[URLClassLoader].getDeclaredMethod("addURL", classOf[URL])
>       method.setAccessible(true)
>       method.invoke(loader, url)
>     }
>     catch {
>       case t: Throwable => {
>         throw new IOException("Error, could not add URL to system
> classloader")
>       }
>     }
>   }
>
>
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <so...@cloudera.com> wrote:
>
> > I might be stating the obvious for everyone, but the issue here is not
> > reflection or the source of the JAR, but the ClassLoader. The basic
> > rules are this.
> >
> > "new Foo" will use the ClassLoader that defines Foo. This is usually
> > the ClassLoader that loaded whatever it is that first referenced Foo
> > and caused it to be loaded -- usually the ClassLoader holding your
> > other app classes.
> >
> > ClassLoaders can have a parent-child relationship. ClassLoaders always
> > look in their parent before themselves.
> >
> > (Careful then -- in contexts like Hadoop or Tomcat where your app is
> > loaded in a child ClassLoader, and you reference a class that Hadoop
> > or Tomcat also has (like a lib class) you will get the container's
> > version!)
> >
> > When you load an external JAR it has a separate ClassLoader which does
> > not necessarily bear any relation to the one containing your app
> > classes, so yeah it is not generally going to make "new Foo" work.
> >
> > Reflection lets you pick the ClassLoader, yes.
> >
> > I would not call setContextClassLoader.
> >
> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <sa...@cloudera.com>
> > wrote:
> > > I spoke with DB offline about this a little while ago and he confirmed
> > that
> > > he was able to access the jar from the driver.
> > >
> > > The issue appears to be a general Java issue: you can't directly
> > > instantiate a class from a dynamically loaded jar.
> > >
> > > I reproduced it locally outside of Spark with:
> > > ---
> > >     URLClassLoader urlClassLoader = new URLClassLoader(new URL[] { new
> > > File("myotherjar.jar").toURI().toURL() }, null);
> > >     Thread.currentThread().setContextClassLoader(urlClassLoader);
> > >     MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
> > > ---
> > >
> > > I was able to load the class with reflection.
> >
>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by DB Tsai <db...@stanford.edu>.
Hi Sean,

It's true that the issue here is classloader, and due to the classloader
delegation model, users have to use reflection in the executors to pick up
the classloader in order to use those classes added by sc.addJars APIs.
However, it's very inconvenience for users, and not documented in spark.

I'm working on a patch to solve it by calling the protected method addURL
in URLClassLoader to update the current default classloader, so no
customClassLoader anymore. I wonder if this is an good way to go.

  private def addURL(url: URL, loader: URLClassLoader){
    try {
      val method: Method =
classOf[URLClassLoader].getDeclaredMethod("addURL", classOf[URL])
      method.setAccessible(true)
      method.invoke(loader, url)
    }
    catch {
      case t: Throwable => {
        throw new IOException("Error, could not add URL to system
classloader")
      }
    }
  }



Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sun, May 18, 2014 at 11:57 PM, Sean Owen <so...@cloudera.com> wrote:

> I might be stating the obvious for everyone, but the issue here is not
> reflection or the source of the JAR, but the ClassLoader. The basic
> rules are this.
>
> "new Foo" will use the ClassLoader that defines Foo. This is usually
> the ClassLoader that loaded whatever it is that first referenced Foo
> and caused it to be loaded -- usually the ClassLoader holding your
> other app classes.
>
> ClassLoaders can have a parent-child relationship. ClassLoaders always
> look in their parent before themselves.
>
> (Careful then -- in contexts like Hadoop or Tomcat where your app is
> loaded in a child ClassLoader, and you reference a class that Hadoop
> or Tomcat also has (like a lib class) you will get the container's
> version!)
>
> When you load an external JAR it has a separate ClassLoader which does
> not necessarily bear any relation to the one containing your app
> classes, so yeah it is not generally going to make "new Foo" work.
>
> Reflection lets you pick the ClassLoader, yes.
>
> I would not call setContextClassLoader.
>
> On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <sa...@cloudera.com>
> wrote:
> > I spoke with DB offline about this a little while ago and he confirmed
> that
> > he was able to access the jar from the driver.
> >
> > The issue appears to be a general Java issue: you can't directly
> > instantiate a class from a dynamically loaded jar.
> >
> > I reproduced it locally outside of Spark with:
> > ---
> >     URLClassLoader urlClassLoader = new URLClassLoader(new URL[] { new
> > File("myotherjar.jar").toURI().toURL() }, null);
> >     Thread.currentThread().setContextClassLoader(urlClassLoader);
> >     MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
> > ---
> >
> > I was able to load the class with reflection.
>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Sean Owen <so...@cloudera.com>.
I might be stating the obvious for everyone, but the issue here is not
reflection or the source of the JAR, but the ClassLoader. The basic
rules are this.

"new Foo" will use the ClassLoader that defines Foo. This is usually
the ClassLoader that loaded whatever it is that first referenced Foo
and caused it to be loaded -- usually the ClassLoader holding your
other app classes.

ClassLoaders can have a parent-child relationship. ClassLoaders always
look in their parent before themselves.

(Careful then -- in contexts like Hadoop or Tomcat where your app is
loaded in a child ClassLoader, and you reference a class that Hadoop
or Tomcat also has (like a lib class) you will get the container's
version!)

When you load an external JAR it has a separate ClassLoader which does
not necessarily bear any relation to the one containing your app
classes, so yeah it is not generally going to make "new Foo" work.

Reflection lets you pick the ClassLoader, yes.

I would not call setContextClassLoader.

On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <sa...@cloudera.com> wrote:
> I spoke with DB offline about this a little while ago and he confirmed that
> he was able to access the jar from the driver.
>
> The issue appears to be a general Java issue: you can't directly
> instantiate a class from a dynamically loaded jar.
>
> I reproduced it locally outside of Spark with:
> ---
>     URLClassLoader urlClassLoader = new URLClassLoader(new URL[] { new
> File("myotherjar.jar").toURI().toURL() }, null);
>     Thread.currentThread().setContextClassLoader(urlClassLoader);
>     MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
> ---
>
> I was able to load the class with reflection.

Fwd: Calling external classes added by sc.addJar needs to be through reflection

Posted by DB Tsai <db...@stanford.edu>.
Since the additional jars added by sc.addJars are through http server, even
it works, we still want to have a better way due to scalability (imagine
that thousands of workers downloading jars from driver).

If we ignore the fundamental scalability issue, this can be fixed by using
the customClassloader to create a wrapped class, and in this wrapped class,
the classloader is inherited from the customClassloader so that users don't
need to do reflection in the wrapped class. I'm working on this now.

Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


---------- Forwarded message ----------
From: Sandy Ryza <sa...@cloudera.com>
Date: Sun, May 18, 2014 at 4:49 PM
Subject: Re: Calling external classes added by sc.addJar needs to be
through reflection
To: "dev@spark.apache.org" <de...@spark.apache.org>


Hey Xiangrui,

If the jars are placed in the distributed cache and loaded statically, as
the primary app jar is in YARN, then it shouldn't be an issue.  Other jars,
however, including additional jars that are sc.addJar'd and jars specified
with the spark-submit --jars argument, are loaded dynamically by executors
with a URLClassLoader.  These jars aren't next to the executors when they
start - the executors fetch them from the driver's HTTP server.


On Sun, May 18, 2014 at 4:05 PM, Xiangrui Meng <me...@gmail.com> wrote:

> Hi Sandy,
>
> It is hard to imagine that a user needs to create an object in that
> way. Since the jars are already in distributed cache before the
> executor starts, is there any reason we cannot add the locally cached
> jars to classpath directly?
>
> Best,
> Xiangrui
>
> On Sun, May 18, 2014 at 4:00 PM, Sandy Ryza <sa...@cloudera.com>
> wrote:
> > I spoke with DB offline about this a little while ago and he confirmed
> that
> > he was able to access the jar from the driver.
> >
> > The issue appears to be a general Java issue: you can't directly
> > instantiate a class from a dynamically loaded jar.
> >
> > I reproduced it locally outside of Spark with:
> > ---
> >     URLClassLoader urlClassLoader = new URLClassLoader(new URL[] { new
> > File("myotherjar.jar").toURI().toURL() }, null);
> >     Thread.currentThread().setContextClassLoader(urlClassLoader);
> >     MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
> > ---
> >
> > I was able to load the class with reflection.
> >
> >
> >
> > On Sun, May 18, 2014 at 11:58 AM, Patrick Wendell <pwendell@gmail.com
> >wrote:
> >
> >> @db - it's possible that you aren't including the jar in the classpath
> >> of your driver program (I think this is what mridul was suggesting).
> >> It would be helpful to see the stack trace of the CNFE.
> >>
> >> - Patrick
> >>
> >> On Sun, May 18, 2014 at 11:54 AM, Patrick Wendell <pw...@gmail.com>
> >> wrote:
> >> > @xiangrui - we don't expect these to be present on the system
> >> > classpath, because they get dynamically added by Spark (e.g. your
> >> > application can call sc.addJar well after the JVM's have started).
> >> >
> >> > @db - I'm pretty surprised to see that behavior. It's definitely not
> >> > intended that users need reflection to instantiate their classes -
> >> > something odd is going on in your case. If you could create an
> >> > isolated example and post it to the JIRA, that would be great.
> >> >
> >> > On Sun, May 18, 2014 at 9:58 AM, Xiangrui Meng <me...@gmail.com>
> wrote:
> >> >> I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870
> >> >>
> >> >> DB, could you add more info to that JIRA? Thanks!
> >> >>
> >> >> -Xiangrui
> >> >>
> >> >> On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng <me...@gmail.com>
> >> wrote:
> >> >>> Btw, I tried
> >> >>>
> >> >>> rdd.map { i =>
> >> >>>   System.getProperty("java.class.path")
> >> >>> }.collect()
> >> >>>
> >> >>> but didn't see the jars added via "--jars" on the executor
> classpath.
> >> >>>
> >> >>> -Xiangrui
> >> >>>
> >> >>> On Sat, May 17, 2014 at 11:26 PM, Xiangrui Meng <me...@gmail.com>
> >> wrote:
> >> >>>> I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The
> >> >>>> reflection approach mentioned by DB didn't work either. I checked
> the
> >> >>>> distributed cache on a worker node and found the jar there. It is
> also
> >> >>>> in the Environment tab of the WebUI. The workaround is making an
> >> >>>> assembly jar.
> >> >>>>
> >> >>>> DB, could you create a JIRA and describe what you have found so
> far?
> >> Thanks!
> >> >>>>
> >> >>>> Best,
> >> >>>> Xiangrui
> >> >>>>
> >> >>>> On Sat, May 17, 2014 at 1:29 AM, Mridul Muralidharan <
> >> mridul@gmail.com> wrote:
> >> >>>>> Can you try moving your mapPartitions to another class/object
> which
> >> is
> >> >>>>> referenced only after sc.addJar ?
> >> >>>>>
> >> >>>>> I would suspect CNFEx is coming while loading the class
containing
> >> >>>>> mapPartitions before addJars is executed.
> >> >>>>>
> >> >>>>> In general though, dynamic loading of classes means you use
> >> reflection to
> >> >>>>> instantiate it since expectation is you don't know which
> >> implementation
> >> >>>>> provides the interface ... If you statically know it apriori, you
> >> bundle it
> >> >>>>> in your classpath.
> >> >>>>>
> >> >>>>> Regards
> >> >>>>> Mridul
> >> >>>>> On 17-May-2014 7:28 am, "DB Tsai" <db...@stanford.edu> wrote:
> >> >>>>>
> >> >>>>>> Finally find a way out of the ClassLoader maze! It took me some
> >> times to
> >> >>>>>> understand how it works; I think it worths to document it in a
> >> separated
> >> >>>>>> thread.
> >> >>>>>>
> >> >>>>>> We're trying to add external utility.jar which contains
> >> CSVRecordParser,
> >> >>>>>> and we added the jar to executors through sc.addJar APIs.
> >> >>>>>>
> >> >>>>>> If the instance of CSVRecordParser is created without
> reflection, it
> >> >>>>>> raises *ClassNotFound
> >> >>>>>> Exception*.
> >> >>>>>>
> >> >>>>>> data.mapPartitions(lines => {
> >> >>>>>>     val csvParser = new CSVRecordParser((delimiter.charAt(0))
> >> >>>>>>     lines.foreach(line => {
> >> >>>>>>       val lineElems = csvParser.parseLine(line)
> >> >>>>>>     })
> >> >>>>>>     ...
> >> >>>>>>     ...
> >> >>>>>>  )
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> If the instance of CSVRecordParser is created through
reflection,
> >> it works.
> >> >>>>>>
> >> >>>>>> data.mapPartitions(lines => {
> >> >>>>>>     val loader = Thread.currentThread.getContextClassLoader
> >> >>>>>>     val CSVRecordParser =
> >> >>>>>>
loader.loadClass("com.alpine.hadoop.ext.CSVRecordParser")
> >> >>>>>>
> >> >>>>>>     val csvParser =
> CSVRecordParser.getConstructor(Character.TYPE)
> >> >>>>>>
.newInstance(delimiter.charAt(0).asInstanceOf[Character])
> >> >>>>>>
> >> >>>>>>     val parseLine = CSVRecordParser
> >> >>>>>>         .getDeclaredMethod("parseLine", classOf[String])
> >> >>>>>>
> >> >>>>>>     lines.foreach(line => {
> >> >>>>>>        val lineElems = parseLine.invoke(csvParser,
> >> >>>>>> line).asInstanceOf[Array[String]]
> >> >>>>>>     })
> >> >>>>>>     ...
> >> >>>>>>     ...
> >> >>>>>>  )
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> This is identical to this question,
> >> >>>>>>
> >> >>>>>>
> >>
>
http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-without-using-reflection
> >> >>>>>>
> >> >>>>>> It's not intuitive for users to load external classes through
> >> reflection,
> >> >>>>>> but couple available solutions including 1) messing around
> >> >>>>>> systemClassLoader by calling systemClassLoader.addURI through
> >> reflection or
> >> >>>>>> 2) forking another JVM to add jars into classpath before
> bootstrap
> >> loader
> >> >>>>>> are very tricky.
> >> >>>>>>
> >> >>>>>> Any thought on fixing it properly?
> >> >>>>>>
> >> >>>>>> @Xiangrui,
> >> >>>>>> netlib-java jniloader is loaded from netlib-java through
> >> reflection, so
> >> >>>>>> this problem will not be seen.
> >> >>>>>>
> >> >>>>>> Sincerely,
> >> >>>>>>
> >> >>>>>> DB Tsai
> >> >>>>>> -------------------------------------------------------
> >> >>>>>> My Blog: https://www.dbtsai.com
> >> >>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
> >> >>>>>>
> >>
>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Sandy Ryza <sa...@cloudera.com>.
Hey Xiangrui,

If the jars are placed in the distributed cache and loaded statically, as
the primary app jar is in YARN, then it shouldn't be an issue.  Other jars,
however, including additional jars that are sc.addJar'd and jars specified
with the spark-submit --jars argument, are loaded dynamically by executors
with a URLClassLoader.  These jars aren't next to the executors when they
start - the executors fetch them from the driver's HTTP server.


On Sun, May 18, 2014 at 4:05 PM, Xiangrui Meng <me...@gmail.com> wrote:

> Hi Sandy,
>
> It is hard to imagine that a user needs to create an object in that
> way. Since the jars are already in distributed cache before the
> executor starts, is there any reason we cannot add the locally cached
> jars to classpath directly?
>
> Best,
> Xiangrui
>
> On Sun, May 18, 2014 at 4:00 PM, Sandy Ryza <sa...@cloudera.com>
> wrote:
> > I spoke with DB offline about this a little while ago and he confirmed
> that
> > he was able to access the jar from the driver.
> >
> > The issue appears to be a general Java issue: you can't directly
> > instantiate a class from a dynamically loaded jar.
> >
> > I reproduced it locally outside of Spark with:
> > ---
> >     URLClassLoader urlClassLoader = new URLClassLoader(new URL[] { new
> > File("myotherjar.jar").toURI().toURL() }, null);
> >     Thread.currentThread().setContextClassLoader(urlClassLoader);
> >     MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
> > ---
> >
> > I was able to load the class with reflection.
> >
> >
> >
> > On Sun, May 18, 2014 at 11:58 AM, Patrick Wendell <pwendell@gmail.com
> >wrote:
> >
> >> @db - it's possible that you aren't including the jar in the classpath
> >> of your driver program (I think this is what mridul was suggesting).
> >> It would be helpful to see the stack trace of the CNFE.
> >>
> >> - Patrick
> >>
> >> On Sun, May 18, 2014 at 11:54 AM, Patrick Wendell <pw...@gmail.com>
> >> wrote:
> >> > @xiangrui - we don't expect these to be present on the system
> >> > classpath, because they get dynamically added by Spark (e.g. your
> >> > application can call sc.addJar well after the JVM's have started).
> >> >
> >> > @db - I'm pretty surprised to see that behavior. It's definitely not
> >> > intended that users need reflection to instantiate their classes -
> >> > something odd is going on in your case. If you could create an
> >> > isolated example and post it to the JIRA, that would be great.
> >> >
> >> > On Sun, May 18, 2014 at 9:58 AM, Xiangrui Meng <me...@gmail.com>
> wrote:
> >> >> I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870
> >> >>
> >> >> DB, could you add more info to that JIRA? Thanks!
> >> >>
> >> >> -Xiangrui
> >> >>
> >> >> On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng <me...@gmail.com>
> >> wrote:
> >> >>> Btw, I tried
> >> >>>
> >> >>> rdd.map { i =>
> >> >>>   System.getProperty("java.class.path")
> >> >>> }.collect()
> >> >>>
> >> >>> but didn't see the jars added via "--jars" on the executor
> classpath.
> >> >>>
> >> >>> -Xiangrui
> >> >>>
> >> >>> On Sat, May 17, 2014 at 11:26 PM, Xiangrui Meng <me...@gmail.com>
> >> wrote:
> >> >>>> I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The
> >> >>>> reflection approach mentioned by DB didn't work either. I checked
> the
> >> >>>> distributed cache on a worker node and found the jar there. It is
> also
> >> >>>> in the Environment tab of the WebUI. The workaround is making an
> >> >>>> assembly jar.
> >> >>>>
> >> >>>> DB, could you create a JIRA and describe what you have found so
> far?
> >> Thanks!
> >> >>>>
> >> >>>> Best,
> >> >>>> Xiangrui
> >> >>>>
> >> >>>> On Sat, May 17, 2014 at 1:29 AM, Mridul Muralidharan <
> >> mridul@gmail.com> wrote:
> >> >>>>> Can you try moving your mapPartitions to another class/object
> which
> >> is
> >> >>>>> referenced only after sc.addJar ?
> >> >>>>>
> >> >>>>> I would suspect CNFEx is coming while loading the class containing
> >> >>>>> mapPartitions before addJars is executed.
> >> >>>>>
> >> >>>>> In general though, dynamic loading of classes means you use
> >> reflection to
> >> >>>>> instantiate it since expectation is you don't know which
> >> implementation
> >> >>>>> provides the interface ... If you statically know it apriori, you
> >> bundle it
> >> >>>>> in your classpath.
> >> >>>>>
> >> >>>>> Regards
> >> >>>>> Mridul
> >> >>>>> On 17-May-2014 7:28 am, "DB Tsai" <db...@stanford.edu> wrote:
> >> >>>>>
> >> >>>>>> Finally find a way out of the ClassLoader maze! It took me some
> >> times to
> >> >>>>>> understand how it works; I think it worths to document it in a
> >> separated
> >> >>>>>> thread.
> >> >>>>>>
> >> >>>>>> We're trying to add external utility.jar which contains
> >> CSVRecordParser,
> >> >>>>>> and we added the jar to executors through sc.addJar APIs.
> >> >>>>>>
> >> >>>>>> If the instance of CSVRecordParser is created without
> reflection, it
> >> >>>>>> raises *ClassNotFound
> >> >>>>>> Exception*.
> >> >>>>>>
> >> >>>>>> data.mapPartitions(lines => {
> >> >>>>>>     val csvParser = new CSVRecordParser((delimiter.charAt(0))
> >> >>>>>>     lines.foreach(line => {
> >> >>>>>>       val lineElems = csvParser.parseLine(line)
> >> >>>>>>     })
> >> >>>>>>     ...
> >> >>>>>>     ...
> >> >>>>>>  )
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> If the instance of CSVRecordParser is created through reflection,
> >> it works.
> >> >>>>>>
> >> >>>>>> data.mapPartitions(lines => {
> >> >>>>>>     val loader = Thread.currentThread.getContextClassLoader
> >> >>>>>>     val CSVRecordParser =
> >> >>>>>>         loader.loadClass("com.alpine.hadoop.ext.CSVRecordParser")
> >> >>>>>>
> >> >>>>>>     val csvParser =
> CSVRecordParser.getConstructor(Character.TYPE)
> >> >>>>>>         .newInstance(delimiter.charAt(0).asInstanceOf[Character])
> >> >>>>>>
> >> >>>>>>     val parseLine = CSVRecordParser
> >> >>>>>>         .getDeclaredMethod("parseLine", classOf[String])
> >> >>>>>>
> >> >>>>>>     lines.foreach(line => {
> >> >>>>>>        val lineElems = parseLine.invoke(csvParser,
> >> >>>>>> line).asInstanceOf[Array[String]]
> >> >>>>>>     })
> >> >>>>>>     ...
> >> >>>>>>     ...
> >> >>>>>>  )
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> This is identical to this question,
> >> >>>>>>
> >> >>>>>>
> >>
> http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-without-using-reflection
> >> >>>>>>
> >> >>>>>> It's not intuitive for users to load external classes through
> >> reflection,
> >> >>>>>> but couple available solutions including 1) messing around
> >> >>>>>> systemClassLoader by calling systemClassLoader.addURI through
> >> reflection or
> >> >>>>>> 2) forking another JVM to add jars into classpath before
> bootstrap
> >> loader
> >> >>>>>> are very tricky.
> >> >>>>>>
> >> >>>>>> Any thought on fixing it properly?
> >> >>>>>>
> >> >>>>>> @Xiangrui,
> >> >>>>>> netlib-java jniloader is loaded from netlib-java through
> >> reflection, so
> >> >>>>>> this problem will not be seen.
> >> >>>>>>
> >> >>>>>> Sincerely,
> >> >>>>>>
> >> >>>>>> DB Tsai
> >> >>>>>> -------------------------------------------------------
> >> >>>>>> My Blog: https://www.dbtsai.com
> >> >>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
> >> >>>>>>
> >>
>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Xiangrui Meng <me...@gmail.com>.
Hi Sandy,

It is hard to imagine that a user needs to create an object in that
way. Since the jars are already in distributed cache before the
executor starts, is there any reason we cannot add the locally cached
jars to classpath directly?

Best,
Xiangrui

On Sun, May 18, 2014 at 4:00 PM, Sandy Ryza <sa...@cloudera.com> wrote:
> I spoke with DB offline about this a little while ago and he confirmed that
> he was able to access the jar from the driver.
>
> The issue appears to be a general Java issue: you can't directly
> instantiate a class from a dynamically loaded jar.
>
> I reproduced it locally outside of Spark with:
> ---
>     URLClassLoader urlClassLoader = new URLClassLoader(new URL[] { new
> File("myotherjar.jar").toURI().toURL() }, null);
>     Thread.currentThread().setContextClassLoader(urlClassLoader);
>     MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
> ---
>
> I was able to load the class with reflection.
>
>
>
> On Sun, May 18, 2014 at 11:58 AM, Patrick Wendell <pw...@gmail.com>wrote:
>
>> @db - it's possible that you aren't including the jar in the classpath
>> of your driver program (I think this is what mridul was suggesting).
>> It would be helpful to see the stack trace of the CNFE.
>>
>> - Patrick
>>
>> On Sun, May 18, 2014 at 11:54 AM, Patrick Wendell <pw...@gmail.com>
>> wrote:
>> > @xiangrui - we don't expect these to be present on the system
>> > classpath, because they get dynamically added by Spark (e.g. your
>> > application can call sc.addJar well after the JVM's have started).
>> >
>> > @db - I'm pretty surprised to see that behavior. It's definitely not
>> > intended that users need reflection to instantiate their classes -
>> > something odd is going on in your case. If you could create an
>> > isolated example and post it to the JIRA, that would be great.
>> >
>> > On Sun, May 18, 2014 at 9:58 AM, Xiangrui Meng <me...@gmail.com> wrote:
>> >> I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870
>> >>
>> >> DB, could you add more info to that JIRA? Thanks!
>> >>
>> >> -Xiangrui
>> >>
>> >> On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng <me...@gmail.com>
>> wrote:
>> >>> Btw, I tried
>> >>>
>> >>> rdd.map { i =>
>> >>>   System.getProperty("java.class.path")
>> >>> }.collect()
>> >>>
>> >>> but didn't see the jars added via "--jars" on the executor classpath.
>> >>>
>> >>> -Xiangrui
>> >>>
>> >>> On Sat, May 17, 2014 at 11:26 PM, Xiangrui Meng <me...@gmail.com>
>> wrote:
>> >>>> I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The
>> >>>> reflection approach mentioned by DB didn't work either. I checked the
>> >>>> distributed cache on a worker node and found the jar there. It is also
>> >>>> in the Environment tab of the WebUI. The workaround is making an
>> >>>> assembly jar.
>> >>>>
>> >>>> DB, could you create a JIRA and describe what you have found so far?
>> Thanks!
>> >>>>
>> >>>> Best,
>> >>>> Xiangrui
>> >>>>
>> >>>> On Sat, May 17, 2014 at 1:29 AM, Mridul Muralidharan <
>> mridul@gmail.com> wrote:
>> >>>>> Can you try moving your mapPartitions to another class/object which
>> is
>> >>>>> referenced only after sc.addJar ?
>> >>>>>
>> >>>>> I would suspect CNFEx is coming while loading the class containing
>> >>>>> mapPartitions before addJars is executed.
>> >>>>>
>> >>>>> In general though, dynamic loading of classes means you use
>> reflection to
>> >>>>> instantiate it since expectation is you don't know which
>> implementation
>> >>>>> provides the interface ... If you statically know it apriori, you
>> bundle it
>> >>>>> in your classpath.
>> >>>>>
>> >>>>> Regards
>> >>>>> Mridul
>> >>>>> On 17-May-2014 7:28 am, "DB Tsai" <db...@stanford.edu> wrote:
>> >>>>>
>> >>>>>> Finally find a way out of the ClassLoader maze! It took me some
>> times to
>> >>>>>> understand how it works; I think it worths to document it in a
>> separated
>> >>>>>> thread.
>> >>>>>>
>> >>>>>> We're trying to add external utility.jar which contains
>> CSVRecordParser,
>> >>>>>> and we added the jar to executors through sc.addJar APIs.
>> >>>>>>
>> >>>>>> If the instance of CSVRecordParser is created without reflection, it
>> >>>>>> raises *ClassNotFound
>> >>>>>> Exception*.
>> >>>>>>
>> >>>>>> data.mapPartitions(lines => {
>> >>>>>>     val csvParser = new CSVRecordParser((delimiter.charAt(0))
>> >>>>>>     lines.foreach(line => {
>> >>>>>>       val lineElems = csvParser.parseLine(line)
>> >>>>>>     })
>> >>>>>>     ...
>> >>>>>>     ...
>> >>>>>>  )
>> >>>>>>
>> >>>>>>
>> >>>>>> If the instance of CSVRecordParser is created through reflection,
>> it works.
>> >>>>>>
>> >>>>>> data.mapPartitions(lines => {
>> >>>>>>     val loader = Thread.currentThread.getContextClassLoader
>> >>>>>>     val CSVRecordParser =
>> >>>>>>         loader.loadClass("com.alpine.hadoop.ext.CSVRecordParser")
>> >>>>>>
>> >>>>>>     val csvParser = CSVRecordParser.getConstructor(Character.TYPE)
>> >>>>>>         .newInstance(delimiter.charAt(0).asInstanceOf[Character])
>> >>>>>>
>> >>>>>>     val parseLine = CSVRecordParser
>> >>>>>>         .getDeclaredMethod("parseLine", classOf[String])
>> >>>>>>
>> >>>>>>     lines.foreach(line => {
>> >>>>>>        val lineElems = parseLine.invoke(csvParser,
>> >>>>>> line).asInstanceOf[Array[String]]
>> >>>>>>     })
>> >>>>>>     ...
>> >>>>>>     ...
>> >>>>>>  )
>> >>>>>>
>> >>>>>>
>> >>>>>> This is identical to this question,
>> >>>>>>
>> >>>>>>
>> http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-without-using-reflection
>> >>>>>>
>> >>>>>> It's not intuitive for users to load external classes through
>> reflection,
>> >>>>>> but couple available solutions including 1) messing around
>> >>>>>> systemClassLoader by calling systemClassLoader.addURI through
>> reflection or
>> >>>>>> 2) forking another JVM to add jars into classpath before bootstrap
>> loader
>> >>>>>> are very tricky.
>> >>>>>>
>> >>>>>> Any thought on fixing it properly?
>> >>>>>>
>> >>>>>> @Xiangrui,
>> >>>>>> netlib-java jniloader is loaded from netlib-java through
>> reflection, so
>> >>>>>> this problem will not be seen.
>> >>>>>>
>> >>>>>> Sincerely,
>> >>>>>>
>> >>>>>> DB Tsai
>> >>>>>> -------------------------------------------------------
>> >>>>>> My Blog: https://www.dbtsai.com
>> >>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>> >>>>>>
>>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Sandy Ryza <sa...@cloudera.com>.
I spoke with DB offline about this a little while ago and he confirmed that
he was able to access the jar from the driver.

The issue appears to be a general Java issue: you can't directly
instantiate a class from a dynamically loaded jar.

I reproduced it locally outside of Spark with:
---
    URLClassLoader urlClassLoader = new URLClassLoader(new URL[] { new
File("myotherjar.jar").toURI().toURL() }, null);
    Thread.currentThread().setContextClassLoader(urlClassLoader);
    MyClassFromMyOtherJar obj = new MyClassFromMyOtherJar();
---

I was able to load the class with reflection.



On Sun, May 18, 2014 at 11:58 AM, Patrick Wendell <pw...@gmail.com>wrote:

> @db - it's possible that you aren't including the jar in the classpath
> of your driver program (I think this is what mridul was suggesting).
> It would be helpful to see the stack trace of the CNFE.
>
> - Patrick
>
> On Sun, May 18, 2014 at 11:54 AM, Patrick Wendell <pw...@gmail.com>
> wrote:
> > @xiangrui - we don't expect these to be present on the system
> > classpath, because they get dynamically added by Spark (e.g. your
> > application can call sc.addJar well after the JVM's have started).
> >
> > @db - I'm pretty surprised to see that behavior. It's definitely not
> > intended that users need reflection to instantiate their classes -
> > something odd is going on in your case. If you could create an
> > isolated example and post it to the JIRA, that would be great.
> >
> > On Sun, May 18, 2014 at 9:58 AM, Xiangrui Meng <me...@gmail.com> wrote:
> >> I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870
> >>
> >> DB, could you add more info to that JIRA? Thanks!
> >>
> >> -Xiangrui
> >>
> >> On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng <me...@gmail.com>
> wrote:
> >>> Btw, I tried
> >>>
> >>> rdd.map { i =>
> >>>   System.getProperty("java.class.path")
> >>> }.collect()
> >>>
> >>> but didn't see the jars added via "--jars" on the executor classpath.
> >>>
> >>> -Xiangrui
> >>>
> >>> On Sat, May 17, 2014 at 11:26 PM, Xiangrui Meng <me...@gmail.com>
> wrote:
> >>>> I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The
> >>>> reflection approach mentioned by DB didn't work either. I checked the
> >>>> distributed cache on a worker node and found the jar there. It is also
> >>>> in the Environment tab of the WebUI. The workaround is making an
> >>>> assembly jar.
> >>>>
> >>>> DB, could you create a JIRA and describe what you have found so far?
> Thanks!
> >>>>
> >>>> Best,
> >>>> Xiangrui
> >>>>
> >>>> On Sat, May 17, 2014 at 1:29 AM, Mridul Muralidharan <
> mridul@gmail.com> wrote:
> >>>>> Can you try moving your mapPartitions to another class/object which
> is
> >>>>> referenced only after sc.addJar ?
> >>>>>
> >>>>> I would suspect CNFEx is coming while loading the class containing
> >>>>> mapPartitions before addJars is executed.
> >>>>>
> >>>>> In general though, dynamic loading of classes means you use
> reflection to
> >>>>> instantiate it since expectation is you don't know which
> implementation
> >>>>> provides the interface ... If you statically know it apriori, you
> bundle it
> >>>>> in your classpath.
> >>>>>
> >>>>> Regards
> >>>>> Mridul
> >>>>> On 17-May-2014 7:28 am, "DB Tsai" <db...@stanford.edu> wrote:
> >>>>>
> >>>>>> Finally find a way out of the ClassLoader maze! It took me some
> times to
> >>>>>> understand how it works; I think it worths to document it in a
> separated
> >>>>>> thread.
> >>>>>>
> >>>>>> We're trying to add external utility.jar which contains
> CSVRecordParser,
> >>>>>> and we added the jar to executors through sc.addJar APIs.
> >>>>>>
> >>>>>> If the instance of CSVRecordParser is created without reflection, it
> >>>>>> raises *ClassNotFound
> >>>>>> Exception*.
> >>>>>>
> >>>>>> data.mapPartitions(lines => {
> >>>>>>     val csvParser = new CSVRecordParser((delimiter.charAt(0))
> >>>>>>     lines.foreach(line => {
> >>>>>>       val lineElems = csvParser.parseLine(line)
> >>>>>>     })
> >>>>>>     ...
> >>>>>>     ...
> >>>>>>  )
> >>>>>>
> >>>>>>
> >>>>>> If the instance of CSVRecordParser is created through reflection,
> it works.
> >>>>>>
> >>>>>> data.mapPartitions(lines => {
> >>>>>>     val loader = Thread.currentThread.getContextClassLoader
> >>>>>>     val CSVRecordParser =
> >>>>>>         loader.loadClass("com.alpine.hadoop.ext.CSVRecordParser")
> >>>>>>
> >>>>>>     val csvParser = CSVRecordParser.getConstructor(Character.TYPE)
> >>>>>>         .newInstance(delimiter.charAt(0).asInstanceOf[Character])
> >>>>>>
> >>>>>>     val parseLine = CSVRecordParser
> >>>>>>         .getDeclaredMethod("parseLine", classOf[String])
> >>>>>>
> >>>>>>     lines.foreach(line => {
> >>>>>>        val lineElems = parseLine.invoke(csvParser,
> >>>>>> line).asInstanceOf[Array[String]]
> >>>>>>     })
> >>>>>>     ...
> >>>>>>     ...
> >>>>>>  )
> >>>>>>
> >>>>>>
> >>>>>> This is identical to this question,
> >>>>>>
> >>>>>>
> http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-without-using-reflection
> >>>>>>
> >>>>>> It's not intuitive for users to load external classes through
> reflection,
> >>>>>> but couple available solutions including 1) messing around
> >>>>>> systemClassLoader by calling systemClassLoader.addURI through
> reflection or
> >>>>>> 2) forking another JVM to add jars into classpath before bootstrap
> loader
> >>>>>> are very tricky.
> >>>>>>
> >>>>>> Any thought on fixing it properly?
> >>>>>>
> >>>>>> @Xiangrui,
> >>>>>> netlib-java jniloader is loaded from netlib-java through
> reflection, so
> >>>>>> this problem will not be seen.
> >>>>>>
> >>>>>> Sincerely,
> >>>>>>
> >>>>>> DB Tsai
> >>>>>> -------------------------------------------------------
> >>>>>> My Blog: https://www.dbtsai.com
> >>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
> >>>>>>
>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Patrick Wendell <pw...@gmail.com>.
@db - it's possible that you aren't including the jar in the classpath
of your driver program (I think this is what mridul was suggesting).
It would be helpful to see the stack trace of the CNFE.

- Patrick

On Sun, May 18, 2014 at 11:54 AM, Patrick Wendell <pw...@gmail.com> wrote:
> @xiangrui - we don't expect these to be present on the system
> classpath, because they get dynamically added by Spark (e.g. your
> application can call sc.addJar well after the JVM's have started).
>
> @db - I'm pretty surprised to see that behavior. It's definitely not
> intended that users need reflection to instantiate their classes -
> something odd is going on in your case. If you could create an
> isolated example and post it to the JIRA, that would be great.
>
> On Sun, May 18, 2014 at 9:58 AM, Xiangrui Meng <me...@gmail.com> wrote:
>> I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870
>>
>> DB, could you add more info to that JIRA? Thanks!
>>
>> -Xiangrui
>>
>> On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng <me...@gmail.com> wrote:
>>> Btw, I tried
>>>
>>> rdd.map { i =>
>>>   System.getProperty("java.class.path")
>>> }.collect()
>>>
>>> but didn't see the jars added via "--jars" on the executor classpath.
>>>
>>> -Xiangrui
>>>
>>> On Sat, May 17, 2014 at 11:26 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>>> I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The
>>>> reflection approach mentioned by DB didn't work either. I checked the
>>>> distributed cache on a worker node and found the jar there. It is also
>>>> in the Environment tab of the WebUI. The workaround is making an
>>>> assembly jar.
>>>>
>>>> DB, could you create a JIRA and describe what you have found so far? Thanks!
>>>>
>>>> Best,
>>>> Xiangrui
>>>>
>>>> On Sat, May 17, 2014 at 1:29 AM, Mridul Muralidharan <mr...@gmail.com> wrote:
>>>>> Can you try moving your mapPartitions to another class/object which is
>>>>> referenced only after sc.addJar ?
>>>>>
>>>>> I would suspect CNFEx is coming while loading the class containing
>>>>> mapPartitions before addJars is executed.
>>>>>
>>>>> In general though, dynamic loading of classes means you use reflection to
>>>>> instantiate it since expectation is you don't know which implementation
>>>>> provides the interface ... If you statically know it apriori, you bundle it
>>>>> in your classpath.
>>>>>
>>>>> Regards
>>>>> Mridul
>>>>> On 17-May-2014 7:28 am, "DB Tsai" <db...@stanford.edu> wrote:
>>>>>
>>>>>> Finally find a way out of the ClassLoader maze! It took me some times to
>>>>>> understand how it works; I think it worths to document it in a separated
>>>>>> thread.
>>>>>>
>>>>>> We're trying to add external utility.jar which contains CSVRecordParser,
>>>>>> and we added the jar to executors through sc.addJar APIs.
>>>>>>
>>>>>> If the instance of CSVRecordParser is created without reflection, it
>>>>>> raises *ClassNotFound
>>>>>> Exception*.
>>>>>>
>>>>>> data.mapPartitions(lines => {
>>>>>>     val csvParser = new CSVRecordParser((delimiter.charAt(0))
>>>>>>     lines.foreach(line => {
>>>>>>       val lineElems = csvParser.parseLine(line)
>>>>>>     })
>>>>>>     ...
>>>>>>     ...
>>>>>>  )
>>>>>>
>>>>>>
>>>>>> If the instance of CSVRecordParser is created through reflection, it works.
>>>>>>
>>>>>> data.mapPartitions(lines => {
>>>>>>     val loader = Thread.currentThread.getContextClassLoader
>>>>>>     val CSVRecordParser =
>>>>>>         loader.loadClass("com.alpine.hadoop.ext.CSVRecordParser")
>>>>>>
>>>>>>     val csvParser = CSVRecordParser.getConstructor(Character.TYPE)
>>>>>>         .newInstance(delimiter.charAt(0).asInstanceOf[Character])
>>>>>>
>>>>>>     val parseLine = CSVRecordParser
>>>>>>         .getDeclaredMethod("parseLine", classOf[String])
>>>>>>
>>>>>>     lines.foreach(line => {
>>>>>>        val lineElems = parseLine.invoke(csvParser,
>>>>>> line).asInstanceOf[Array[String]]
>>>>>>     })
>>>>>>     ...
>>>>>>     ...
>>>>>>  )
>>>>>>
>>>>>>
>>>>>> This is identical to this question,
>>>>>>
>>>>>> http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-without-using-reflection
>>>>>>
>>>>>> It's not intuitive for users to load external classes through reflection,
>>>>>> but couple available solutions including 1) messing around
>>>>>> systemClassLoader by calling systemClassLoader.addURI through reflection or
>>>>>> 2) forking another JVM to add jars into classpath before bootstrap loader
>>>>>> are very tricky.
>>>>>>
>>>>>> Any thought on fixing it properly?
>>>>>>
>>>>>> @Xiangrui,
>>>>>> netlib-java jniloader is loaded from netlib-java through reflection, so
>>>>>> this problem will not be seen.
>>>>>>
>>>>>> Sincerely,
>>>>>>
>>>>>> DB Tsai
>>>>>> -------------------------------------------------------
>>>>>> My Blog: https://www.dbtsai.com
>>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Patrick Wendell <pw...@gmail.com>.
@xiangrui - we don't expect these to be present on the system
classpath, because they get dynamically added by Spark (e.g. your
application can call sc.addJar well after the JVM's have started).

@db - I'm pretty surprised to see that behavior. It's definitely not
intended that users need reflection to instantiate their classes -
something odd is going on in your case. If you could create an
isolated example and post it to the JIRA, that would be great.

On Sun, May 18, 2014 at 9:58 AM, Xiangrui Meng <me...@gmail.com> wrote:
> I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870
>
> DB, could you add more info to that JIRA? Thanks!
>
> -Xiangrui
>
> On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng <me...@gmail.com> wrote:
>> Btw, I tried
>>
>> rdd.map { i =>
>>   System.getProperty("java.class.path")
>> }.collect()
>>
>> but didn't see the jars added via "--jars" on the executor classpath.
>>
>> -Xiangrui
>>
>> On Sat, May 17, 2014 at 11:26 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>> I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The
>>> reflection approach mentioned by DB didn't work either. I checked the
>>> distributed cache on a worker node and found the jar there. It is also
>>> in the Environment tab of the WebUI. The workaround is making an
>>> assembly jar.
>>>
>>> DB, could you create a JIRA and describe what you have found so far? Thanks!
>>>
>>> Best,
>>> Xiangrui
>>>
>>> On Sat, May 17, 2014 at 1:29 AM, Mridul Muralidharan <mr...@gmail.com> wrote:
>>>> Can you try moving your mapPartitions to another class/object which is
>>>> referenced only after sc.addJar ?
>>>>
>>>> I would suspect CNFEx is coming while loading the class containing
>>>> mapPartitions before addJars is executed.
>>>>
>>>> In general though, dynamic loading of classes means you use reflection to
>>>> instantiate it since expectation is you don't know which implementation
>>>> provides the interface ... If you statically know it apriori, you bundle it
>>>> in your classpath.
>>>>
>>>> Regards
>>>> Mridul
>>>> On 17-May-2014 7:28 am, "DB Tsai" <db...@stanford.edu> wrote:
>>>>
>>>>> Finally find a way out of the ClassLoader maze! It took me some times to
>>>>> understand how it works; I think it worths to document it in a separated
>>>>> thread.
>>>>>
>>>>> We're trying to add external utility.jar which contains CSVRecordParser,
>>>>> and we added the jar to executors through sc.addJar APIs.
>>>>>
>>>>> If the instance of CSVRecordParser is created without reflection, it
>>>>> raises *ClassNotFound
>>>>> Exception*.
>>>>>
>>>>> data.mapPartitions(lines => {
>>>>>     val csvParser = new CSVRecordParser((delimiter.charAt(0))
>>>>>     lines.foreach(line => {
>>>>>       val lineElems = csvParser.parseLine(line)
>>>>>     })
>>>>>     ...
>>>>>     ...
>>>>>  )
>>>>>
>>>>>
>>>>> If the instance of CSVRecordParser is created through reflection, it works.
>>>>>
>>>>> data.mapPartitions(lines => {
>>>>>     val loader = Thread.currentThread.getContextClassLoader
>>>>>     val CSVRecordParser =
>>>>>         loader.loadClass("com.alpine.hadoop.ext.CSVRecordParser")
>>>>>
>>>>>     val csvParser = CSVRecordParser.getConstructor(Character.TYPE)
>>>>>         .newInstance(delimiter.charAt(0).asInstanceOf[Character])
>>>>>
>>>>>     val parseLine = CSVRecordParser
>>>>>         .getDeclaredMethod("parseLine", classOf[String])
>>>>>
>>>>>     lines.foreach(line => {
>>>>>        val lineElems = parseLine.invoke(csvParser,
>>>>> line).asInstanceOf[Array[String]]
>>>>>     })
>>>>>     ...
>>>>>     ...
>>>>>  )
>>>>>
>>>>>
>>>>> This is identical to this question,
>>>>>
>>>>> http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-without-using-reflection
>>>>>
>>>>> It's not intuitive for users to load external classes through reflection,
>>>>> but couple available solutions including 1) messing around
>>>>> systemClassLoader by calling systemClassLoader.addURI through reflection or
>>>>> 2) forking another JVM to add jars into classpath before bootstrap loader
>>>>> are very tricky.
>>>>>
>>>>> Any thought on fixing it properly?
>>>>>
>>>>> @Xiangrui,
>>>>> netlib-java jniloader is loaded from netlib-java through reflection, so
>>>>> this problem will not be seen.
>>>>>
>>>>> Sincerely,
>>>>>
>>>>> DB Tsai
>>>>> -------------------------------------------------------
>>>>> My Blog: https://www.dbtsai.com
>>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Xiangrui Meng <me...@gmail.com>.
I created a JIRA: https://issues.apache.org/jira/browse/SPARK-1870

DB, could you add more info to that JIRA? Thanks!

-Xiangrui

On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng <me...@gmail.com> wrote:
> Btw, I tried
>
> rdd.map { i =>
>   System.getProperty("java.class.path")
> }.collect()
>
> but didn't see the jars added via "--jars" on the executor classpath.
>
> -Xiangrui
>
> On Sat, May 17, 2014 at 11:26 PM, Xiangrui Meng <me...@gmail.com> wrote:
>> I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The
>> reflection approach mentioned by DB didn't work either. I checked the
>> distributed cache on a worker node and found the jar there. It is also
>> in the Environment tab of the WebUI. The workaround is making an
>> assembly jar.
>>
>> DB, could you create a JIRA and describe what you have found so far? Thanks!
>>
>> Best,
>> Xiangrui
>>
>> On Sat, May 17, 2014 at 1:29 AM, Mridul Muralidharan <mr...@gmail.com> wrote:
>>> Can you try moving your mapPartitions to another class/object which is
>>> referenced only after sc.addJar ?
>>>
>>> I would suspect CNFEx is coming while loading the class containing
>>> mapPartitions before addJars is executed.
>>>
>>> In general though, dynamic loading of classes means you use reflection to
>>> instantiate it since expectation is you don't know which implementation
>>> provides the interface ... If you statically know it apriori, you bundle it
>>> in your classpath.
>>>
>>> Regards
>>> Mridul
>>> On 17-May-2014 7:28 am, "DB Tsai" <db...@stanford.edu> wrote:
>>>
>>>> Finally find a way out of the ClassLoader maze! It took me some times to
>>>> understand how it works; I think it worths to document it in a separated
>>>> thread.
>>>>
>>>> We're trying to add external utility.jar which contains CSVRecordParser,
>>>> and we added the jar to executors through sc.addJar APIs.
>>>>
>>>> If the instance of CSVRecordParser is created without reflection, it
>>>> raises *ClassNotFound
>>>> Exception*.
>>>>
>>>> data.mapPartitions(lines => {
>>>>     val csvParser = new CSVRecordParser((delimiter.charAt(0))
>>>>     lines.foreach(line => {
>>>>       val lineElems = csvParser.parseLine(line)
>>>>     })
>>>>     ...
>>>>     ...
>>>>  )
>>>>
>>>>
>>>> If the instance of CSVRecordParser is created through reflection, it works.
>>>>
>>>> data.mapPartitions(lines => {
>>>>     val loader = Thread.currentThread.getContextClassLoader
>>>>     val CSVRecordParser =
>>>>         loader.loadClass("com.alpine.hadoop.ext.CSVRecordParser")
>>>>
>>>>     val csvParser = CSVRecordParser.getConstructor(Character.TYPE)
>>>>         .newInstance(delimiter.charAt(0).asInstanceOf[Character])
>>>>
>>>>     val parseLine = CSVRecordParser
>>>>         .getDeclaredMethod("parseLine", classOf[String])
>>>>
>>>>     lines.foreach(line => {
>>>>        val lineElems = parseLine.invoke(csvParser,
>>>> line).asInstanceOf[Array[String]]
>>>>     })
>>>>     ...
>>>>     ...
>>>>  )
>>>>
>>>>
>>>> This is identical to this question,
>>>>
>>>> http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-without-using-reflection
>>>>
>>>> It's not intuitive for users to load external classes through reflection,
>>>> but couple available solutions including 1) messing around
>>>> systemClassLoader by calling systemClassLoader.addURI through reflection or
>>>> 2) forking another JVM to add jars into classpath before bootstrap loader
>>>> are very tricky.
>>>>
>>>> Any thought on fixing it properly?
>>>>
>>>> @Xiangrui,
>>>> netlib-java jniloader is loaded from netlib-java through reflection, so
>>>> this problem will not be seen.
>>>>
>>>> Sincerely,
>>>>
>>>> DB Tsai
>>>> -------------------------------------------------------
>>>> My Blog: https://www.dbtsai.com
>>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by DB Tsai <db...@stanford.edu>.
The reflection actually works. But you need to get the loader by `val
loader = Thread.currentThread.getContextClassLoader` which is set by Spark
executor. Our team verified this, and uses it as workaround.



Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sun, May 18, 2014 at 9:46 AM, Xiangrui Meng <me...@gmail.com> wrote:

> Btw, I tried
>
> rdd.map { i =>
>   System.getProperty("java.class.path")
> }.collect()
>
> but didn't see the jars added via "--jars" on the executor classpath.
>
> -Xiangrui
>
> On Sat, May 17, 2014 at 11:26 PM, Xiangrui Meng <me...@gmail.com> wrote:
> > I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The
> > reflection approach mentioned by DB didn't work either. I checked the
> > distributed cache on a worker node and found the jar there. It is also
> > in the Environment tab of the WebUI. The workaround is making an
> > assembly jar.
> >
> > DB, could you create a JIRA and describe what you have found so far?
> Thanks!
> >
> > Best,
> > Xiangrui
> >
> > On Sat, May 17, 2014 at 1:29 AM, Mridul Muralidharan <mr...@gmail.com>
> wrote:
> >> Can you try moving your mapPartitions to another class/object which is
> >> referenced only after sc.addJar ?
> >>
> >> I would suspect CNFEx is coming while loading the class containing
> >> mapPartitions before addJars is executed.
> >>
> >> In general though, dynamic loading of classes means you use reflection
> to
> >> instantiate it since expectation is you don't know which implementation
> >> provides the interface ... If you statically know it apriori, you
> bundle it
> >> in your classpath.
> >>
> >> Regards
> >> Mridul
> >> On 17-May-2014 7:28 am, "DB Tsai" <db...@stanford.edu> wrote:
> >>
> >>> Finally find a way out of the ClassLoader maze! It took me some times
> to
> >>> understand how it works; I think it worths to document it in a
> separated
> >>> thread.
> >>>
> >>> We're trying to add external utility.jar which contains
> CSVRecordParser,
> >>> and we added the jar to executors through sc.addJar APIs.
> >>>
> >>> If the instance of CSVRecordParser is created without reflection, it
> >>> raises *ClassNotFound
> >>> Exception*.
> >>>
> >>> data.mapPartitions(lines => {
> >>>     val csvParser = new CSVRecordParser((delimiter.charAt(0))
> >>>     lines.foreach(line => {
> >>>       val lineElems = csvParser.parseLine(line)
> >>>     })
> >>>     ...
> >>>     ...
> >>>  )
> >>>
> >>>
> >>> If the instance of CSVRecordParser is created through reflection, it
> works.
> >>>
> >>> data.mapPartitions(lines => {
> >>>     val loader = Thread.currentThread.getContextClassLoader
> >>>     val CSVRecordParser =
> >>>         loader.loadClass("com.alpine.hadoop.ext.CSVRecordParser")
> >>>
> >>>     val csvParser = CSVRecordParser.getConstructor(Character.TYPE)
> >>>         .newInstance(delimiter.charAt(0).asInstanceOf[Character])
> >>>
> >>>     val parseLine = CSVRecordParser
> >>>         .getDeclaredMethod("parseLine", classOf[String])
> >>>
> >>>     lines.foreach(line => {
> >>>        val lineElems = parseLine.invoke(csvParser,
> >>> line).asInstanceOf[Array[String]]
> >>>     })
> >>>     ...
> >>>     ...
> >>>  )
> >>>
> >>>
> >>> This is identical to this question,
> >>>
> >>>
> http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-without-using-reflection
> >>>
> >>> It's not intuitive for users to load external classes through
> reflection,
> >>> but couple available solutions including 1) messing around
> >>> systemClassLoader by calling systemClassLoader.addURI through
> reflection or
> >>> 2) forking another JVM to add jars into classpath before bootstrap
> loader
> >>> are very tricky.
> >>>
> >>> Any thought on fixing it properly?
> >>>
> >>> @Xiangrui,
> >>> netlib-java jniloader is loaded from netlib-java through reflection, so
> >>> this problem will not be seen.
> >>>
> >>> Sincerely,
> >>>
> >>> DB Tsai
> >>> -------------------------------------------------------
> >>> My Blog: https://www.dbtsai.com
> >>> LinkedIn: https://www.linkedin.com/in/dbtsai
> >>>
>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Xiangrui Meng <me...@gmail.com>.
Btw, I tried

rdd.map { i =>
  System.getProperty("java.class.path")
}.collect()

but didn't see the jars added via "--jars" on the executor classpath.

-Xiangrui

On Sat, May 17, 2014 at 11:26 PM, Xiangrui Meng <me...@gmail.com> wrote:
> I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The
> reflection approach mentioned by DB didn't work either. I checked the
> distributed cache on a worker node and found the jar there. It is also
> in the Environment tab of the WebUI. The workaround is making an
> assembly jar.
>
> DB, could you create a JIRA and describe what you have found so far? Thanks!
>
> Best,
> Xiangrui
>
> On Sat, May 17, 2014 at 1:29 AM, Mridul Muralidharan <mr...@gmail.com> wrote:
>> Can you try moving your mapPartitions to another class/object which is
>> referenced only after sc.addJar ?
>>
>> I would suspect CNFEx is coming while loading the class containing
>> mapPartitions before addJars is executed.
>>
>> In general though, dynamic loading of classes means you use reflection to
>> instantiate it since expectation is you don't know which implementation
>> provides the interface ... If you statically know it apriori, you bundle it
>> in your classpath.
>>
>> Regards
>> Mridul
>> On 17-May-2014 7:28 am, "DB Tsai" <db...@stanford.edu> wrote:
>>
>>> Finally find a way out of the ClassLoader maze! It took me some times to
>>> understand how it works; I think it worths to document it in a separated
>>> thread.
>>>
>>> We're trying to add external utility.jar which contains CSVRecordParser,
>>> and we added the jar to executors through sc.addJar APIs.
>>>
>>> If the instance of CSVRecordParser is created without reflection, it
>>> raises *ClassNotFound
>>> Exception*.
>>>
>>> data.mapPartitions(lines => {
>>>     val csvParser = new CSVRecordParser((delimiter.charAt(0))
>>>     lines.foreach(line => {
>>>       val lineElems = csvParser.parseLine(line)
>>>     })
>>>     ...
>>>     ...
>>>  )
>>>
>>>
>>> If the instance of CSVRecordParser is created through reflection, it works.
>>>
>>> data.mapPartitions(lines => {
>>>     val loader = Thread.currentThread.getContextClassLoader
>>>     val CSVRecordParser =
>>>         loader.loadClass("com.alpine.hadoop.ext.CSVRecordParser")
>>>
>>>     val csvParser = CSVRecordParser.getConstructor(Character.TYPE)
>>>         .newInstance(delimiter.charAt(0).asInstanceOf[Character])
>>>
>>>     val parseLine = CSVRecordParser
>>>         .getDeclaredMethod("parseLine", classOf[String])
>>>
>>>     lines.foreach(line => {
>>>        val lineElems = parseLine.invoke(csvParser,
>>> line).asInstanceOf[Array[String]]
>>>     })
>>>     ...
>>>     ...
>>>  )
>>>
>>>
>>> This is identical to this question,
>>>
>>> http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-without-using-reflection
>>>
>>> It's not intuitive for users to load external classes through reflection,
>>> but couple available solutions including 1) messing around
>>> systemClassLoader by calling systemClassLoader.addURI through reflection or
>>> 2) forking another JVM to add jars into classpath before bootstrap loader
>>> are very tricky.
>>>
>>> Any thought on fixing it properly?
>>>
>>> @Xiangrui,
>>> netlib-java jniloader is loaded from netlib-java through reflection, so
>>> this problem will not be seen.
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> -------------------------------------------------------
>>> My Blog: https://www.dbtsai.com
>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Xiangrui Meng <me...@gmail.com>.
I can re-produce the error with Spark 1.0-RC and YARN (CDH-5). The
reflection approach mentioned by DB didn't work either. I checked the
distributed cache on a worker node and found the jar there. It is also
in the Environment tab of the WebUI. The workaround is making an
assembly jar.

DB, could you create a JIRA and describe what you have found so far? Thanks!

Best,
Xiangrui

On Sat, May 17, 2014 at 1:29 AM, Mridul Muralidharan <mr...@gmail.com> wrote:
> Can you try moving your mapPartitions to another class/object which is
> referenced only after sc.addJar ?
>
> I would suspect CNFEx is coming while loading the class containing
> mapPartitions before addJars is executed.
>
> In general though, dynamic loading of classes means you use reflection to
> instantiate it since expectation is you don't know which implementation
> provides the interface ... If you statically know it apriori, you bundle it
> in your classpath.
>
> Regards
> Mridul
> On 17-May-2014 7:28 am, "DB Tsai" <db...@stanford.edu> wrote:
>
>> Finally find a way out of the ClassLoader maze! It took me some times to
>> understand how it works; I think it worths to document it in a separated
>> thread.
>>
>> We're trying to add external utility.jar which contains CSVRecordParser,
>> and we added the jar to executors through sc.addJar APIs.
>>
>> If the instance of CSVRecordParser is created without reflection, it
>> raises *ClassNotFound
>> Exception*.
>>
>> data.mapPartitions(lines => {
>>     val csvParser = new CSVRecordParser((delimiter.charAt(0))
>>     lines.foreach(line => {
>>       val lineElems = csvParser.parseLine(line)
>>     })
>>     ...
>>     ...
>>  )
>>
>>
>> If the instance of CSVRecordParser is created through reflection, it works.
>>
>> data.mapPartitions(lines => {
>>     val loader = Thread.currentThread.getContextClassLoader
>>     val CSVRecordParser =
>>         loader.loadClass("com.alpine.hadoop.ext.CSVRecordParser")
>>
>>     val csvParser = CSVRecordParser.getConstructor(Character.TYPE)
>>         .newInstance(delimiter.charAt(0).asInstanceOf[Character])
>>
>>     val parseLine = CSVRecordParser
>>         .getDeclaredMethod("parseLine", classOf[String])
>>
>>     lines.foreach(line => {
>>        val lineElems = parseLine.invoke(csvParser,
>> line).asInstanceOf[Array[String]]
>>     })
>>     ...
>>     ...
>>  )
>>
>>
>> This is identical to this question,
>>
>> http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-without-using-reflection
>>
>> It's not intuitive for users to load external classes through reflection,
>> but couple available solutions including 1) messing around
>> systemClassLoader by calling systemClassLoader.addURI through reflection or
>> 2) forking another JVM to add jars into classpath before bootstrap loader
>> are very tricky.
>>
>> Any thought on fixing it properly?
>>
>> @Xiangrui,
>> netlib-java jniloader is loaded from netlib-java through reflection, so
>> this problem will not be seen.
>>
>> Sincerely,
>>
>> DB Tsai
>> -------------------------------------------------------
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>

Re: Calling external classes added by sc.addJar needs to be through reflection

Posted by Mridul Muralidharan <mr...@gmail.com>.
Can you try moving your mapPartitions to another class/object which is
referenced only after sc.addJar ?

I would suspect CNFEx is coming while loading the class containing
mapPartitions before addJars is executed.

In general though, dynamic loading of classes means you use reflection to
instantiate it since expectation is you don't know which implementation
provides the interface ... If you statically know it apriori, you bundle it
in your classpath.

Regards
Mridul
On 17-May-2014 7:28 am, "DB Tsai" <db...@stanford.edu> wrote:

> Finally find a way out of the ClassLoader maze! It took me some times to
> understand how it works; I think it worths to document it in a separated
> thread.
>
> We're trying to add external utility.jar which contains CSVRecordParser,
> and we added the jar to executors through sc.addJar APIs.
>
> If the instance of CSVRecordParser is created without reflection, it
> raises *ClassNotFound
> Exception*.
>
> data.mapPartitions(lines => {
>     val csvParser = new CSVRecordParser((delimiter.charAt(0))
>     lines.foreach(line => {
>       val lineElems = csvParser.parseLine(line)
>     })
>     ...
>     ...
>  )
>
>
> If the instance of CSVRecordParser is created through reflection, it works.
>
> data.mapPartitions(lines => {
>     val loader = Thread.currentThread.getContextClassLoader
>     val CSVRecordParser =
>         loader.loadClass("com.alpine.hadoop.ext.CSVRecordParser")
>
>     val csvParser = CSVRecordParser.getConstructor(Character.TYPE)
>         .newInstance(delimiter.charAt(0).asInstanceOf[Character])
>
>     val parseLine = CSVRecordParser
>         .getDeclaredMethod("parseLine", classOf[String])
>
>     lines.foreach(line => {
>        val lineElems = parseLine.invoke(csvParser,
> line).asInstanceOf[Array[String]]
>     })
>     ...
>     ...
>  )
>
>
> This is identical to this question,
>
> http://stackoverflow.com/questions/7452411/thread-currentthread-setcontextclassloader-without-using-reflection
>
> It's not intuitive for users to load external classes through reflection,
> but couple available solutions including 1) messing around
> systemClassLoader by calling systemClassLoader.addURI through reflection or
> 2) forking another JVM to add jars into classpath before bootstrap loader
> are very tricky.
>
> Any thought on fixing it properly?
>
> @Xiangrui,
> netlib-java jniloader is loaded from netlib-java through reflection, so
> this problem will not be seen.
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>