You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Eric Kimbrel <le...@gmail.com> on 2014/01/13 22:29:51 UTC

yarn SPARK_CLASSPATH

Is there any extra trick required to use jars on the SPARK_CLASSPATH when running spark on yarn?

I have several jars added to the SPARK_CLASSPATH in spark_env.sh   When my job runs i print the SPARK_CLASSPATH so i can see that the jars were added to the environment that the app master is running in, however even though the jars are on the class path I continue to get class not found errors.

I have also tried setting SPARK_CLASSPATH via SPARK_YARN_USER_ENV

Re: yarn SPARK_CLASSPATH

Posted by Eric K <le...@gmail.com>.
If i add the jars with the --addJars option it looks like things start
working.  This is kind of ugly because there are alot of jars to add (hbase
jars and dependencies),  It looks like adding the jars to SPARK_CLASSPATH
may successfully add the jars to the works, but doesn't add them to the
applicaiton master / driver program.  Is there a way to add all the jars in
a directory to the application master?




On Mon, Jan 13, 2014 at 1:57 PM, John Zhao <jz...@alpinenow.com> wrote:

> I have been facing the same problem. In my case, it turns out you don’t
> need set any class path, the class not found exception is caused by the
> hadoop version.
> I was trying to submit a spark to hadoop 2.2.0 with yarn. So after I do
> the following, it works fine .
>
> SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true ./sbt/sbt *clean* assembly
>
> Hope this can help you.
>
> John.
>
>
> On Jan 13, 2014, at 1:29 PM, Eric Kimbrel <le...@gmail.com> wrote:
>
> Is there any extra trick required to use jars on the SPARK_CLASSPATH when
> running spark on yarn?
>
> I have several jars added to the SPARK_CLASSPATH in spark_env.sh   When my
> job runs i print the SPARK_CLASSPATH so i can see that the jars were added
> to the environment that the app master is running in, however even though
> the jars are on the class path I continue to get class not found errors.
>
> I have also tried setting SPARK_CLASSPATH via SPARK_YARN_USER_ENV
>
>
>

Re: yarn SPARK_CLASSPATH

Posted by John Zhao <jz...@alpinenow.com>.
I have been facing the same problem. In my case, it turns out you don’t need set any class path, the class not found exception is caused by the hadoop version.
I was trying to submit a spark to hadoop 2.2.0 with yarn. So after I do the following, it works fine . 
SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true ./sbt/sbt clean assembly
Hope this can help you.

John.


On Jan 13, 2014, at 1:29 PM, Eric Kimbrel <le...@gmail.com> wrote:

> Is there any extra trick required to use jars on the SPARK_CLASSPATH when running spark on yarn?
> 
> I have several jars added to the SPARK_CLASSPATH in spark_env.sh   When my job runs i print the SPARK_CLASSPATH so i can see that the jars were added to the environment that the app master is running in, however even though the jars are on the class path I continue to get class not found errors.
> 
> I have also tried setting SPARK_CLASSPATH via SPARK_YARN_USER_ENV


Re: yarn SPARK_CLASSPATH

Posted by Tom Graves <tg...@yahoo.com>.
The right way to setup yarn/hadoop is tricky as its really very dependent upon your usage of it. 
 
Since HBase is a hadoop service you might just add it to your hadoop config yarn.application.classpath and have it on the classpath for all users/applications of that grid.  In this way you are treating it like how it picks up the HDFS jars.  Not sure if you have control over that though?  The risk of doing this is if there are dependency conflicts or versioning issues.  If you have other applications like mapReduce that use hbase then it might make sense to do this.

The other option is to modify the spark on yarn code to have it add it to the classpath for you.  Either just add whatever is in the SPARK_CLASSPATH to it or have a separate variable.  Even though we wouldn't use it I can see it as being useful for some installations. 

Tom



On Monday, January 13, 2014 5:58 PM, Eric K <le...@gmail.com> wrote:
 
Thanks for that extra insight about yarn.  I am new to the whole yarn eco-system so i've been having trouble figuring out the right way to do some things.  Sounds like even though the jars are already installed as part of our cluster on all the nodes, i should just go ahead and add them with the --files methods to simplify things and avoid having them added for all applications.

Thanks




On Mon, Jan 13, 2014 at 3:01 PM, Tom Graves <tg...@yahoo.com> wrote:

I'm assuming you actually installed the jar on all the yarn clusters then?
>
>
>In general this isn't a good idea on yarn as most users don't have permissions to install things on the nodes themselves.  The idea is Yarn provides a certain set of jars which really should be just the yarn/hadoop framework,  it adds those to your classpath and the user provides everything else application specific when they submit their application and those get distributed with the app and added to the classpath.   If you are worried about it being downloaded everytime, you can use the public distributed cache on yarn as a way to distribute it and share it.  It will only be removed from that nodes distributed cache if other applications need that space.
>
>
>That said what yarn adds to the classpath is configurable via the hadoop configuration file yarn-site.xml, config name: yarn.application.classpath.  So you can change the config to add it, but it will be added for all types of applications. 
>
>
>You can use the --files and --archives options in yarn-standalone mode to use the distributed cache.  To make it public, make sure permissions on the file are set appropriately.
>
>
>Tom
>
>
>
>On Monday, January 13, 2014 3:49 PM, Eric Kimbrel <le...@gmail.com> wrote:
> 
>Is there any extra trick required to use jars on the SPARK_CLASSPATH when running spark on yarn?
>
>I have several jars added to the SPARK_CLASSPATH in spark_env.sh   When my job runs i print the SPARK_CLASSPATH so i can
 see that the jars were added to the environment that the app master is running in, however even though the jars are on the class path I continue to get class not found errors.
>
>I have also tried setting SPARK_CLASSPATH via SPARK_YARN_USER_ENV
>
>

Re: yarn SPARK_CLASSPATH

Posted by Eric K <le...@gmail.com>.
Thanks for that extra insight about yarn.  I am new to the whole yarn
eco-system so i've been having trouble figuring out the right way to do
some things.  Sounds like even though the jars are already installed as
part of our cluster on all the nodes, i should just go ahead and add them
with the --files methods to simplify things and avoid having them added for
all applications.

Thanks



On Mon, Jan 13, 2014 at 3:01 PM, Tom Graves <tg...@yahoo.com> wrote:

> I'm assuming you actually installed the jar on all the yarn clusters then?
>
> In general this isn't a good idea on yarn as most users don't have
> permissions to install things on the nodes themselves.  The idea is Yarn
> provides a certain set of jars which really should be just the yarn/hadoop
> framework,  it adds those to your classpath and the user provides
> everything else application specific when they submit their application and
> those get distributed with the app and added to the classpath.   If you
> are worried about it being downloaded everytime, you can use the public
> distributed cache on yarn as a way to distribute it and share it.  It will
> only be removed from that nodes distributed cache if other applications
> need that space.
>
> That said what yarn adds to the classpath is configurable via the hadoop
> configuration file yarn-site.xml, config name: yarn.application.classpath.
>  So you can change the config to add it, but it will be added for all types
> of applications.
>
> You can use the --files and --archives options in yarn-standalone mode to
> use the distributed cache.  To make it public, make sure permissions on the
> file are set appropriately.
>
> Tom
>
>
>   On Monday, January 13, 2014 3:49 PM, Eric Kimbrel <le...@gmail.com>
> wrote:
>  Is there any extra trick required to use jars on the SPARK_CLASSPATH
> when running spark on yarn?
>
> I have several jars added to the SPARK_CLASSPATH in spark_env.sh  When my
> job runs i print the SPARK_CLASSPATH so i can see that the jars were added
> to the environment that the app master is running in, however even though
> the jars are on the class path I continue to get class not found errors.
>
> I have also tried setting SPARK_CLASSPATH via SPARK_YARN_USER_ENV
>
>

Re: yarn SPARK_CLASSPATH

Posted by Tom Graves <tg...@yahoo.com>.
I'm assuming you actually installed the jar on all the yarn clusters then?

In general this isn't a good idea on yarn as most users don't have permissions to install things on the nodes themselves.  The idea is Yarn provides a certain set of jars which really should be just the yarn/hadoop framework,  it adds those to your classpath and the user provides everything else application specific when they submit their application and those get distributed with the app and added to the classpath.   If you are worried about it being downloaded everytime, you can use the public distributed cache on yarn as a way to distribute it and share it.  It will only be removed from that nodes distributed cache if other applications need that space.

That said what yarn adds to the classpath is configurable via the hadoop configuration file yarn-site.xml, config name: yarn.application.classpath.  So you can change the config to add it, but it will be added for all types of applications. 

You can use the --files and --archives options in yarn-standalone mode to use the distributed cache.  To make it public, make sure permissions on the file are set appropriately.

Tom



On Monday, January 13, 2014 3:49 PM, Eric Kimbrel <le...@gmail.com> wrote:
 
Is there any extra trick required to use jars on the SPARK_CLASSPATH when running spark on yarn?

I have several jars added to the SPARK_CLASSPATH in spark_env.sh   When my job runs i print the SPARK_CLASSPATH so i can see that the jars were added to the environment that the app master is running in, however even though the jars are on the class path I continue to get class not found errors.

I have also tried setting SPARK_CLASSPATH via SPARK_YARN_USER_ENV

Re: yarn SPARK_CLASSPATH

Posted by Eric K <le...@gmail.com>.
I am launching a job on a spark cluster.   To get the application to work i
have to run it like this:

HADOOP_CONF_DIR=$CONF \
> SPARK_JAR=$SPARK_ASSEMBLY \
>  $SPARK/spark-class org.apache.spark.deploy.yarn.Client \
>  --jar <path to my jar> \
>  --class<my main class> \
>  --args <my command line arguments> \
>  --num-workers 7 \
>  --master-memory 164g \
>  --worker-memory 164g \
>  --worker-cores 20 \
>  --files file:///usr/lib/hbase/lib/hbase-common-0.95.2-cdh5.0.0-beta-1
> .jar,file:///usr/lib/hbase/lib/hbase-client-0.95.2-cdh5.0.0-beta-1
> .jar,file:///usr/lib/hbase/lib/hbase-protocol-0.95.2-cdh5.0.0-beta-1
> .jar,file:///usr/lib/hbase/lib/htrace-core-2.01.jar



What this ends up doing is taking all of the hbase jars (added with --file)
and copying them into distributed cache and distributing them to each
container (at least thats how i understand it).   What i would like to do
instead is simply place all of these jars onto the SPARK_CLASSPATH since
they are already available on every node of the cluster.   However even
when i place them on the spark class path i get class not found errors.



On Mon, Jan 13, 2014 at 2:19 PM, Izhar ul Hassan <ez...@gmail.com> wrote:

> Erik, could you please provide a little more details? Log excerpts and/or
> commands you are running would be helpful.
>
> /Izhar
>
>
> On Monday, January 13, 2014, Eric Kimbrel wrote:
>
>> Is there any extra trick required to use jars on the SPARK_CLASSPATH when
>> running spark on yarn?
>>
>> I have several jars added to the SPARK_CLASSPATH in spark_env.sh   When
>> my job runs i print the SPARK_CLASSPATH so i can see that the jars were
>> added to the environment that the app master is running in, however even
>> though the jars are on the class path I continue to get class not found
>> errors.
>>
>> I have also tried setting SPARK_CLASSPATH via SPARK_YARN_USER_ENV
>
>
>
> --
> --
> /Izhar
>
>

Re: yarn SPARK_CLASSPATH

Posted by Izhar ul Hassan <ez...@gmail.com>.
Erik, could you please provide a little more details? Log excerpts and/or
commands you are running would be helpful.

/Izhar

On Monday, January 13, 2014, Eric Kimbrel wrote:

> Is there any extra trick required to use jars on the SPARK_CLASSPATH when
> running spark on yarn?
>
> I have several jars added to the SPARK_CLASSPATH in spark_env.sh   When my
> job runs i print the SPARK_CLASSPATH so i can see that the jars were added
> to the environment that the app master is running in, however even though
> the jars are on the class path I continue to get class not found errors.
>
> I have also tried setting SPARK_CLASSPATH via SPARK_YARN_USER_ENV



-- 
-- 
/Izhar