You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Greg Hill <gr...@RACKSPACE.COM> on 2014/09/02 17:06:19 UTC

Spark on YARN question

I'm working on setting up Spark on YARN using the HDP technical preview - http://hortonworks.com/kb/spark-1-0-1-technical-preview-hdp-2-1-3/

I have installed the Spark JARs on all the slave nodes and configured YARN to find the JARs.  It seems like everything is working.

Unless I'm misunderstanding, it seems like there isn't any configuration required on the YARN slave nodes at all, apart from telling YARN where to find the Spark JAR files.  Do the YARN processes even pick up local Spark configuration files on the slave nodes, or is that all just pulled in on the client and passed along to YARN?

Greg

Re: Spark on YARN question

Posted by "Dimension Data, LLC." <su...@didata.us>.
Hi Andrew:

Ah okay, thank you for clarifying (1) and (2)... (even answering my 
unwritten question about
'yarn-cluster', too.). :)

I will definitely use the 'spark.yarn.jar' property (and stop using 
SPARK_JAR). Thanks.

Finally this, from the --help output (with my small addition in red)...

   --jars  Comma-separated list of local jars to include on
           (i.e. copy to) the driver and executor classpaths.

    I'm guessing that if proper permissions don't exist remotely for 
that 'copy', an
    exception will occur during the copy attempt? So care has to be 
taken there.


Thank you again! =:)



On 09/02/2014 06:36 PM, Andrew Or wrote:
> Hi Didata,
>
> (1) Correct. The default deploy mode is `client`, so both masters 
> `yarn` and `yarn-client` run Spark in client mode. If you explicitly 
> specify master as `yarn-cluster`, Spark will run in cluster mode. If 
> you implicitly specify one deploy mode through the master (e.g. 
> yarn-client) but set deploy mode to the opposite (e.g. cluster), Spark 
> will complain and throw an exception. :)
>
> (2) The jars passed through the `--jars` option only need to be 
> visible to the spark-submit program. Depending on the deploy mode, 
> they will be propagated to the containers (i.e. the executors, and the 
> driver in cluster mode) differently so you don't need to manually copy 
> them yourself, either through rsync'ing or uploading to HDFS. Another 
> thing is that "SPARK_JAR" is technically deprecated (you should get a 
> warning for using it). Instead, you can set "spark.yarn.jar" in your 
> conf/spark-defaults.conf on the submitter node.
>
> Let me know if you have more questions,
> -Andrew
>
>
> 2014-09-02 15:12 GMT-07:00 Dimension Data, LLC. 
> <subscriptions@didata.us <ma...@didata.us>>:
>
>     Hello friends:
>
>     I have a follow-up to Andrew's well articulated answer below
>     (thank you for that).
>
>     (1) I've seen both of these invocations in various places:
>
>           (a) '--master yarn'
>           (b) '--master yarn-client'
>
>         the latter of which doesn't appear in
>     '/pyspark//|//spark-submit|spark-shell --help/' output.
>
>         Is case (a) meant for cluster-mode apps (where the driver is
>     out on a YARN ApplicationMaster,
>         and case (b) for client-mode apps needing client interaction
>     locally?
>
>         Also (related), is case (b) simply shorthand for the following
>     invocation syntax?
>            '--master yarn --deploy-mode client'
>
>     (2) Seeking clarification on the first sentence below...
>
>     /    Note: To avoid a copy of the Assembly JAR every time I launch
>     a job, I place it (the lat//est//
>     //    version) at a specific (but otherwise arbitrary) location on
>     HDFS, and then set SPARK_JAR,
>         like so (//where you can thankfully use wild-cards//)//://
>     //
>     //       export
>     SPARK_JAR=hdfs://namenode:8020///path/to///spark-assembly-*.jar/
>
>         But my question here is, when specifying additional JARS like
>     this '--jars /path/to/jar1,/path/to/jar2,...'
>         to /pyspark|spark-submit|spark-shell/ commands, are those JARS
>     expected to *already* be
>         at those path locations on both the _submitter_ server, as
>     well as on YARN _worker_ servers?
>
>         In other words, the '--jars' option won't cause the command to
>     look for them locally at those path
>         locations, and then ship & place them to the same
>     path-locations remotely? They need to be there
>         already, both locally and remotely. Correct?
>
>     Thank you. :)
>     didata
>
>
>     On 09/02/2014 12:05 PM, Andrew Or wrote:
>>     Hi Greg,
>>
>>     You should not need to even manually install Spark on each of the
>>     worker nodes or put it into HDFS yourself. Spark on Yarn will
>>     ship all necessary jars (i.e. the assembly + additional jars) to
>>     each of the containers for you. You can specify additional jars
>>     that your application depends on through the --jars argument if
>>     you are using spark-submit / spark-shell / pyspark. As for
>>     environment variables, you can specify SPARK_YARN_USER_ENV on the
>>     driver node (where your application is submitted) to specify
>>     environment variables to be observed by your executors. If you
>>     are using the spark-submit / spark-shell / pyspark scripts, then
>>     you can set Spark properties in the conf/spark-defaults.conf
>>     properties file, and these will be propagated to the executors.
>>     In other words, configurations on the slave nodes don't do anything.
>>
>>     For example,
>>     $ vim conf/spark-defaults.conf // set a few properties
>>     $ export SPARK_YARN_USER_ENV=YARN_LOCAL_DIR=/mnt,/mnt2
>>     $ bin/spark-shell --master yarn --jars
>>     /local/path/to/my/jar1,/another/jar2
>>
>>     Best,
>>     -Andrew
>
>

-- 
Dimension Data, LLC.
Sincerely yours,
Team Dimension Data
------------------------------------------------------------------------
Dimension Data, LLC. <https://www.didata.us> | www.didata.us 
<https://www.didata.us>
P: 212.882.1276| subscriptions@didata.us <ma...@didata.us>
Follow Us: https://www.LinkedIn.com/company/didata 
<https://www.LinkedIn.com/company/didata>

Dimension Data, LLC. <http://www.didata.us>
Data Analytics you can literally count on.


Re: Spark on YARN question

Posted by Andrew Or <an...@databricks.com>.
Hi Didata,

(1) Correct. The default deploy mode is `client`, so both masters `yarn`
and `yarn-client` run Spark in client mode. If you explicitly specify
master as `yarn-cluster`, Spark will run in cluster mode. If you implicitly
specify one deploy mode through the master (e.g. yarn-client) but set
deploy mode to the opposite (e.g. cluster), Spark will complain and throw
an exception. :)

(2) The jars passed through the `--jars` option only need to be visible to
the spark-submit program. Depending on the deploy mode, they will be
propagated to the containers (i.e. the executors, and the driver in cluster
mode) differently so you don't need to manually copy them yourself, either
through rsync'ing or uploading to HDFS. Another thing is that "SPARK_JAR"
is technically deprecated (you should get a warning for using it). Instead,
you can set "spark.yarn.jar" in your conf/spark-defaults.conf on the
submitter node.

Let me know if you have more questions,
-Andrew


2014-09-02 15:12 GMT-07:00 Dimension Data, LLC. <su...@didata.us>:

>  Hello friends:
>
> I have a follow-up to Andrew's well articulated answer below (thank you
> for that).
>
> (1) I've seen both of these invocations in various places:
>
>       (a) '--master yarn'
>       (b) '--master yarn-client'
>
>     the latter of which doesn't appear in '*pyspark**|**spark-submit|spark-shell
> --help*' output.
>
>     Is case (a) meant for cluster-mode apps (where the driver is out on a
> YARN ApplicationMaster,
>     and case (b) for client-mode apps needing client interaction locally?
>
>     Also (related), is case (b) simply shorthand for the following
> invocation syntax?
>        '--master yarn --deploy-mode client'
>
> (2) Seeking clarification on the first sentence below...
>
> *    Note: To avoid a copy of the Assembly JAR every time I launch a job,
> I place it (the lat**est*
>
> *    version) at a specific (but otherwise arbitrary) location on HDFS,
> and then set SPARK_JAR,     like so (**where you can thankfully use
> wild-cards**)**:*
>
> *       export SPARK_JAR=hdfs://namenode:8020/**path/to*
> */spark-assembly-*.jar*
>
>     But my question here is, when specifying additional JARS like this
> '--jars /path/to/jar1,/path/to/jar2,...'
>     to *pyspark|spark-submit|spark-shell* commands, are those JARS
> expected to *already* be
>     at those path locations on both the _submitter_ server, as well as on
> YARN _worker_ servers?
>
>     In other words, the '--jars' option won't cause the command to look
> for them locally at those path
>     locations, and then ship & place them to the same path-locations
> remotely? They need to be there
>     already, both locally and remotely. Correct?
>
> Thank you. :)
> didata
>
>
>  On 09/02/2014 12:05 PM, Andrew Or wrote:
>
> Hi Greg,
>
>  You should not need to even manually install Spark on each of the worker
> nodes or put it into HDFS yourself. Spark on Yarn will ship all necessary
> jars (i.e. the assembly + additional jars) to each of the containers for
> you. You can specify additional jars that your application depends on
> through the --jars argument if you are using spark-submit / spark-shell /
> pyspark. As for environment variables, you can specify SPARK_YARN_USER_ENV
> on the driver node (where your application is submitted) to specify
> environment variables to be observed by your executors. If you are using
> the spark-submit / spark-shell / pyspark scripts, then you can set Spark
> properties in the conf/spark-defaults.conf properties file, and these will
> be propagated to the executors. In other words, configurations on the slave
> nodes don't do anything.
>
>  For example,
> $ vim conf/spark-defaults.conf // set a few properties
> $ export SPARK_YARN_USER_ENV=YARN_LOCAL_DIR=/mnt,/mnt2
> $ bin/spark-shell --master yarn --jars /local/path/to/my/jar1,/another/jar2
>
>  Best,
> -Andrew
>
>

Re: Spark on YARN question

Posted by "Dimension Data, LLC." <su...@didata.us>.
Hello friends:

I have a follow-up to Andrew's well articulated answer below (thank you 
for that).

(1) I've seen both of these invocations in various places:

       (a) '--master yarn'
       (b) '--master yarn-client'

     the latter of which doesn't appear in 
'/pyspark//|//spark-submit|spark-shell --help/' output.

     Is case (a) meant for cluster-mode apps (where the driver is out on 
a YARN ApplicationMaster,
     and case (b) for client-mode apps needing client interaction locally?

     Also (related), is case (b) simply shorthand for the following 
invocation syntax?
        '--master yarn --deploy-mode client'

(2) Seeking clarification on the first sentence below...

/    Note: To avoid a copy of the Assembly JAR every time I launch a 
job, I place it (the lat//est//
//    version) at a specific (but otherwise arbitrary) location on HDFS, 
and then set SPARK_JAR,
     like so (//where you can thankfully use wild-cards//)//://
//
//       export 
SPARK_JAR=hdfs://namenode:8020///path/to///spark-assembly-*.jar/

     But my question here is, when specifying additional JARS like this 
'--jars /path/to/jar1,/path/to/jar2,...'
     to /pyspark|spark-submit|spark-shell/ commands, are those JARS 
expected to *already* be
     at those path locations on both the _submitter_ server, as well as 
on YARN _worker_ servers?

     In other words, the '--jars' option won't cause the command to look 
for them locally at those path
     locations, and then ship & place them to the same path-locations 
remotely? They need to be there
     already, both locally and remotely. Correct?

Thank you. :)
didata


On 09/02/2014 12:05 PM, Andrew Or wrote:
> Hi Greg,
>
> You should not need to even manually install Spark on each of the 
> worker nodes or put it into HDFS yourself. Spark on Yarn will ship all 
> necessary jars (i.e. the assembly + additional jars) to each of the 
> containers for you. You can specify additional jars that your 
> application depends on through the --jars argument if you are using 
> spark-submit / spark-shell / pyspark. As for environment variables, 
> you can specify SPARK_YARN_USER_ENV on the driver node (where your 
> application is submitted) to specify environment variables to be 
> observed by your executors. If you are using the spark-submit / 
> spark-shell / pyspark scripts, then you can set Spark properties in 
> the conf/spark-defaults.conf properties file, and these will be 
> propagated to the executors. In other words, configurations on the 
> slave nodes don't do anything.
>
> For example,
> $ vim conf/spark-defaults.conf // set a few properties
> $ export SPARK_YARN_USER_ENV=YARN_LOCAL_DIR=/mnt,/mnt2
> $ bin/spark-shell --master yarn --jars 
> /local/path/to/my/jar1,/another/jar2
>
> Best,
> -Andrew

Re: Spark on YARN question

Posted by Greg Hill <gr...@RACKSPACE.COM>.
Thanks.  That sounds like how I was thinking it worked.  I did have to install the JARs on the slave nodes for yarn-cluster mode to work, FWIW.  It's probably just whichever node ends up spawning the application master that needs it, but it wasn't passed along from spark-submit.

Greg

From: Andrew Or <an...@databricks.com>>
Date: Tuesday, September 2, 2014 11:05 AM
To: Matt Narrell <ma...@gmail.com>>
Cc: Greg <gr...@rackspace.com>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: Spark on YARN question

Hi Greg,

You should not need to even manually install Spark on each of the worker nodes or put it into HDFS yourself. Spark on Yarn will ship all necessary jars (i.e. the assembly + additional jars) to each of the containers for you. You can specify additional jars that your application depends on through the --jars argument if you are using spark-submit / spark-shell / pyspark. As for environment variables, you can specify SPARK_YARN_USER_ENV on the driver node (where your application is submitted) to specify environment variables to be observed by your executors. If you are using the spark-submit / spark-shell / pyspark scripts, then you can set Spark properties in the conf/spark-defaults.conf properties file, and these will be propagated to the executors. In other words, configurations on the slave nodes don't do anything.

For example,
$ vim conf/spark-defaults.conf // set a few properties
$ export SPARK_YARN_USER_ENV=YARN_LOCAL_DIR=/mnt,/mnt2
$ bin/spark-shell --master yarn --jars /local/path/to/my/jar1,/another/jar2

Best,
-Andrew

Re: Spark on YARN question

Posted by Andrew Or <an...@databricks.com>.
Hi Greg,

You should not need to even manually install Spark on each of the worker
nodes or put it into HDFS yourself. Spark on Yarn will ship all necessary
jars (i.e. the assembly + additional jars) to each of the containers for
you. You can specify additional jars that your application depends on
through the --jars argument if you are using spark-submit / spark-shell /
pyspark. As for environment variables, you can specify SPARK_YARN_USER_ENV
on the driver node (where your application is submitted) to specify
environment variables to be observed by your executors. If you are using
the spark-submit / spark-shell / pyspark scripts, then you can set Spark
properties in the conf/spark-defaults.conf properties file, and these will
be propagated to the executors. In other words, configurations on the slave
nodes don't do anything.

For example,
$ vim conf/spark-defaults.conf // set a few properties
$ export SPARK_YARN_USER_ENV=YARN_LOCAL_DIR=/mnt,/mnt2
$ bin/spark-shell --master yarn --jars /local/path/to/my/jar1,/another/jar2

Best,
-Andrew

Re: Spark on YARN question

Posted by Matt Narrell <ma...@gmail.com>.
I’ve put my Spark JAR into HDFS, and specify the SPARK_JAR variable to point to the HDFS location of the jar.  I’m not using any specialized configuration files (like spark-env.sh), but rather setting things either by environment variable per node, passing application arguments to the job, or making a Zookeeper connection from my job to seed properties.  From there, I can construct a SparkConf as necessary.

mn

On Sep 2, 2014, at 9:06 AM, Greg Hill <gr...@RACKSPACE.COM> wrote:

> I'm working on setting up Spark on YARN using the HDP technical preview - http://hortonworks.com/kb/spark-1-0-1-technical-preview-hdp-2-1-3/
> 
> I have installed the Spark JARs on all the slave nodes and configured YARN to find the JARs.  It seems like everything is working.
> 
> Unless I'm misunderstanding, it seems like there isn't any configuration required on the YARN slave nodes at all, apart from telling YARN where to find the Spark JAR files.  Do the YARN processes even pick up local Spark configuration files on the slave nodes, or is that all just pulled in on the client and passed along to YARN?
> 
> Greg