You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Dimension Data, LLC." <su...@didata.us> on 2014/09/08 18:35:21 UTC

If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

Hello friends:

It was mentioned in another (Y.A.R.N.-centric) email thread that 
'SPARK_JAR' was deprecated,
and to use the 'spark.yarn.jar' property instead for YARN submission. 
For example:

    user$ pyspark [some-options] --driver-java-options 
spark.yarn.jar=hdfs://namenode:8020/path/to/spark-assembly-*.jar

What is the equivalent property to use for the LOCAL MODE case? 
spark.jar? spark.local.jar?
I searched for this, but can't find where the definitions for these 
exist (perhaps a pointer
to that, too). :)

For completeness/explicitness, I like to specify things like this on the 
CLI, even if there
are default settings them.

Thank you!
didata


//

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

Posted by Marcelo Vanzin <va...@cloudera.com>.

Yes, that's how file: URLs are interpreted everywhere in Spark. (It's also
explained in the link to the docs I posted earlier.)

The second interpretation below is "local:" URLs in Spark, but that doesn't
work with Yarn on Spark 1.0 (so it won't work with CDH 5.1 and older
either).

On Mon, Sep 8, 2014 at 6:00 PM, Dimension Data, LLC. <
subscriptions@didata.us> wrote:

>  Even when using 'file:///...' nomenclature in SPARK_JAR (instead of
> through the yet-to-be implemented 'spark.yarn.jar'\
> property), it's interpretation still seems to be:
>
>    'Tell me where the local spark jar is located so that I can upload it
> (i.e. hdfs dfs -put) to a HDFS staging area for you'.
>               -as oppoese to-
>    'Tell me where the local spark jar is located on the NMs, and I will
> look for it at that UNIX path'.
>

-- 
Marcelo

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

Posted by "Dimension Data, LLC." <su...@didata.us>.

Thank you.

Just a note:

Even when using 'file:///...' nomenclature in SPARK_JAR (instead of 
through the yet-to-be implemented 'spark.yarn.jar'\
property), it's interpretation still seems to be:

    'Tell me where the local spark jar is located so that I can upload 
it (i.e. hdfs dfs -put) to a HDFS staging area for you'.
               -as oppoese to-
    'Tell me where the local spark jar is located on the NMs, and I will 
look for it at that UNIX path'.

Seems to be the first case for both 'SPARK_JAR' and for 'spark.yarn.jar'.

Thank you for the CDH5.2 tip. :)

On 09/08/2014 07:41 PM, Marcelo Vanzin wrote:
>
> On Mon, Sep 8, 2014 at 3:54 PM, Dimension Data, LLC. 
> <subscriptions@didata.us <ma...@didata.us>> wrote:
>
>     You're probably right about the above because, as seen *below* for
>     pyspark (but probably for other Spark
>     applications too), once
>     '-Dspark.master=[yarn-client|yarn-cluster]' is specified, the app
>     invocation doesn't even seem to
>     respect the property '-Dspark.yarn.jar=[file:/// | file:// |/...]'
>     setting.
>
>     That's too bad because I have CDH5 Spark installed
>
>
> If you're using CDH 5 (either 5.0 or 5.1) then spark.yarn.jar won't 
> work. For CDH5 you want to set the SPARK_JAR environment variable instead.
>
> The change that added spark.yarn.jar is part of Spark 1.1 (which will 
> be part of CDH 5.2).
>
> -- 
> Marcelo

-- 
Dimension Data, LLC.
Sincerely yours,
Team Dimension Data
------------------------------------------------------------------------
Dimension Data, LLC. <https://www.didata.us> | www.didata.us 
<https://www.didata.us>
P: 212.882.1276| subscriptions@didata.us <ma...@didata.us>
Follow Us: https://www.LinkedIn.com/company/didata 
<https://www.LinkedIn.com/company/didata>

Dimension Data, LLC. <http://www.didata.us>
Data Analytics you can literally count on.

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

Posted by Marcelo Vanzin <va...@cloudera.com>.

On Mon, Sep 8, 2014 at 3:54 PM, Dimension Data, LLC. <
subscriptions@didata.us> wrote:

>  You're probably right about the above because, as seen *below* for
> pyspark (but probably for other Spark
> applications too), once '-Dspark.master=[yarn-client|yarn-cluster]' is
> specified, the app invocation doesn't even seem to
> respect the property '-Dspark.yarn.jar=[file:/// | file:// |/...]'
> setting.
>
> That's too bad because I have CDH5 Spark installed
>

If you're using CDH 5 (either 5.0 or 5.1) then spark.yarn.jar won't work.
For CDH5 you want to set the SPARK_JAR environment variable instead.

The change that added spark.yarn.jar is part of Spark 1.1 (which will be
part of CDH 5.2).

-- 
Marcelo

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

Posted by "Dimension Data, LLC." <su...@didata.us>.

 >> That's correct. In fact, I'm not aware of Yarn working at all 
without the HDFS configuration being in place
 >> (even if the default fs is not HDFS), but then I'm not a Yarn 
deployment expert.

Thanks for the clarifications Marcelo.

You're probably right about the above because, as seen *below* for 
pyspark (but probably for other Spark
applications too), once '-Dspark.master=[yarn-client|yarn-cluster]' is 
specified, the app invocation doesn't even seem to
respect the property '-Dspark.yarn.jar=[file:/// | file:// |/...]' setting.

That's too bad because I have CDH5 Spark installed (and identically 
configured) across all NMs, so there is no need to
grab it from HDFS (... meaning, each has it locally here available: 
'*/usr/lib/spark/assembly/lib/spark-assembly-*.jar*).


14/09/08 18:36:20 INFO Client: Setting up container launch context
14/09/08 18:36:20 INFO Client: Command for starting the Spark 
ApplicationMaster: List($JAVA_HOME/bin/java, -server, -Xmx1024m, 
-Djava.io.tmpdir=$PWD/tmp,
  -Dspark.tachyonStore.folderName=\"spark-7cca3c72-5d81-4024-a0ca-e0b83bf2866c\", -Dspark.yarn.jar=\"f*ile:///usr/lib/spark/assembly/lib/spark-assembly-*.j
ar\*", -Dspark.yarn.secondary.jars=\"\", -Dspark.submit.pyFiles=\"\", 
-Dspark.driver.host=\"g750asus\", 
-Dspark.driver.appUIHistoryAddress=\"\", -Dspark.a
pp.name=\"PySparkShell\", 
-Dspark.fileserver.uri=\"http://192.168.0.15:50452\", 
*-Dspark.master=\"yarn-client\"*, -Dspark.driver.port=\"34770\", -Dspark.ht
tpBroadcast.uri=\"http://192.168.0.15:59074\", 
-Dlog4j.configuration=log4j-spark-container.properties, 
org.apache.spark.deploy.yarn.ExecutorLauncher, --
class, notused, --jar , null,  --args  'g750asus:34770' , 
--executor-memory, 512, --executor-cores, 4, --num-executors , 4, 1>, 
<LOG_DIR>/stdout, 2>, <LO
G_DIR>/stderr)


*Thank again!* :)




On 09/08/2014 03:46 PM, Marcelo Vanzin wrote:
>
> On Mon, Sep 8, 2014 at 11:52 AM, Dimension Data, LLC. 
> <subscriptions@didata.us <ma...@didata.us>> wrote:
>
>     So just to clarify for me: When specifying 'spark.yarn.jar' as I
>     did above, even if I don't use HDFS to create a
>     RDD (e.g. do something simple like: 'sc.parallelize(range(100))'),
>     it is still necessary to configure the HDFS
>     location in each NM's '/etc/hadoop/conf/*', just so that they can
>     access the Spark Jar in the YARN case?
>
>
> That's correct. In fact, I'm not aware of Yarn working at all without 
> the HDFS configuration being in place (even if the default fs is not 
> HDFS), but then I'm not a Yarn deployment expert.
>
> -- 
> Marcelo

-- 
Dimension Data, LLC.
Sincerely yours,
Team Dimension Data
------------------------------------------------------------------------
Dimension Data, LLC. <https://www.didata.us> | www.didata.us 
<https://www.didata.us>
P: 212.882.1276| subscriptions@didata.us <ma...@didata.us>
Follow Us: https://www.LinkedIn.com/company/didata 
<https://www.LinkedIn.com/company/didata>

Dimension Data, LLC. <http://www.didata.us>
Data Analytics you can literally count on.

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

Posted by Marcelo Vanzin <va...@cloudera.com>.

On Mon, Sep 8, 2014 at 11:52 AM, Dimension Data, LLC. <
subscriptions@didata.us> wrote:

>  So just to clarify for me: When specifying 'spark.yarn.jar' as I did
> above, even if I don't use HDFS to create a
> RDD (e.g. do something simple like: 'sc.parallelize(range(100))'), it is
> still necessary to configure the HDFS
> location in each NM's '/etc/hadoop/conf/*', just so that they can access
> the Spark Jar in the YARN case?
>

That's correct. In fact, I'm not aware of Yarn working at all without the
HDFS configuration being in place (even if the default fs is not HDFS), but
then I'm not a Yarn deployment expert.

-- 
Marcelo

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

Posted by "Dimension Data, LLC." <su...@didata.us>.

Ah okay. I understand now Marcelo (thank you).

Long ago I did notice that when specifying 'export 
MASTER=yarn-client|cluster' (or, equivalently, --master=yarn-client|cluster)
that (as you pointed out) an upload of the assembly JAR to HDFS is 
initiated each time a job is started; and so I simply
upload the latest one to a well-known constant HDFS path, and inform the 
client of that HDFS path via the option:

      '... -Dspark.yarn.jar=hdfs://path/to/spark/assemblyJar'.

_Interestingly, I never thought about th__e following__until you just 
mentioned something_ ...
I configure my NodeManagers to be able to connect to HDFS, when needed 
-- say, in case a dataset needs to be
formed from assets existing in HDFS. But I didn't realize that NMs also 
need the HDFS configuration to be in place
(which, for CDH5 is here: '/etc/hadoop/conf/*.xml') just to be able to 
get the Spark Assembly JAR, too (because I
thought it was somehow shipped to NMs as part of Job launch).

So just to clarify for me: When specifying 'spark.yarn.jar' as I did 
above, even if I don't use HDFS to create a
RDD (e.g. do something simple like: 'sc.parallelize(range(100))'), it is 
still necessary to configure the HDFS
location in each NM's '/etc/hadoop/conf/*', just so that they can access 
the Spark Jar in the YARN case?
(... again, b/c the Spark Jar isn't magically uploaded to them from the 
local server from which the job is
launched, which is what I had thought).


I hope I wrote that out well. =:)

Thanks again!
nmvega


On 09/08/2014 01:36 PM, Marcelo Vanzin wrote:
>
> On Mon, Sep 8, 2014 at 10:00 AM, Dimension Data, LLC. 
> <subscriptions@didata.us <ma...@didata.us>> wrote:
>
>     user$ export MASTER=local[nn] # Run spark shell on LOCAL CPU threads.
>     user$ pyspark [someOptions] --driver-java-options
>     -Dspark.*XYZ*.jar='/usr/lib/spark/assembly/lib/spark-assembly-*.jar'
>
>     My question is, what to replace '*XYZ*' with in that case.
>
>
> Ah, I see. There's no equivalent to that option in non-yarn mode, 
> because either there is no need (e.g. in local mode everything is in 
> the same machine) or the cluster backend doesn't support the feature 
> (e.g. Spark Master does not have a distributed file cache like Yarn, 
> at least as far as I know).
>
> That option is just to tell Yarn to use the distributed cache for the 
> spark jar instead of auto-detecting where the jar is (which would 
> incur in uploading the spark jar to all involved NMs).
>
> -- 
> Marcelo

-- 
Dimension Data, LLC.
Sincerely yours,
Team Dimension Data
------------------------------------------------------------------------
Dimension Data, LLC. <https://www.didata.us> | www.didata.us 
<https://www.didata.us>
P: 212.882.1276| subscriptions@didata.us <ma...@didata.us>
Follow Us: https://www.LinkedIn.com/company/didata 
<https://www.LinkedIn.com/company/didata>

Dimension Data, LLC. <http://www.didata.us>
Data Analytics you can literally count on.

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

Posted by Marcelo Vanzin <va...@cloudera.com>.

On Mon, Sep 8, 2014 at 10:00 AM, Dimension Data, LLC. <
subscriptions@didata.us> wrote:

>  user$ export MASTER=local[nn] # Run spark shell on LOCAL CPU threads.
> user$ pyspark [someOptions] --driver-java-options -Dspark.*XYZ*.jar='
> /usr/lib/spark/assembly/lib/spark-assembly-*.jar'
>
> My question is, what to replace '*XYZ*' with in that case.
>

Ah, I see. There's no equivalent to that option in non-yarn mode, because
either there is no need (e.g. in local mode everything is in the same
machine) or the cluster backend doesn't support the feature (e.g. Spark
Master does not have a distributed file cache like Yarn, at least as far as
I know).

That option is just to tell Yarn to use the distributed cache for the spark
jar instead of auto-detecting where the jar is (which would incur in
uploading the spark jar to all involved NMs).

-- 
Marcelo

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

Posted by "Dimension Data, LLC." <su...@didata.us>.

Hi Marcelo:

Thanks for the 'Advanced Depency Management' URL (I'll definitely have a 
read).

Indeed I do use a '-D' (Sorry for the paste error, but nice catch though).
_
__Basically looking for this_:
user$ export MASTER=yarn-client # Run spark shell on YARN.
user$ pyspark [someOptions] --driver-java-options 
-Dspark.yarn.jar='hdfs://namenode:8020/path/to/spark-assembly-*.jar'

         -OR-

user$ export MASTER=local[nn] # Run spark shell on LOCAL CPU threads.
user$ pyspark [someOptions] --driver-java-options 
-Dspark.*XYZ*.jar='/usr/lib/spark/assembly/lib/spark-assembly-*.jar'

My question is, what to replace '*XYZ*' with in that case.


Thank you.


On 09/08/2014 12:41 PM, Marcelo Vanzin wrote:
> On Mon, Sep 8, 2014 at 9:35 AM, Dimension Data, LLC.
> <su...@didata.us> wrote:
>>     user$ pyspark [some-options] --driver-java-options
>> spark.yarn.jar=hdfs://namenode:8020/path/to/spark-assembly-*.jar
> This command line does not look correct. "spark.yarn.jar" is not a JVM
> command line option. You most probably need a "-D" before that.
>
>> What is the equivalent property to use for the LOCAL MODE case?
> What do you mean local mode? --master local? A local file?
>
> That location is a URL, so set it to what makes sense in your case. See also:
> http://spark.apache.org/docs/latest/submitting-applications.html
>
> (Under "Advanced Dependency Management".)
>

-- 
Dimension Data, LLC.
Sincerely yours,
Team Dimension Data
------------------------------------------------------------------------
Dimension Data, LLC. <https://www.didata.us> | www.didata.us 
<https://www.didata.us>
P: 212.882.1276| subscriptions@didata.us <ma...@didata.us>
Follow Us: https://www.LinkedIn.com/company/didata 
<https://www.LinkedIn.com/company/didata>

Dimension Data, LLC. <http://www.didata.us>
Data Analytics you can literally count on.

Re: If for YARN you use 'spark.yarn.jar', what is the LOCAL equivalent to that property ...

Posted by Marcelo Vanzin <va...@cloudera.com>.

On Mon, Sep 8, 2014 at 9:35 AM, Dimension Data, LLC.
<su...@didata.us> wrote:
>    user$ pyspark [some-options] --driver-java-options
> spark.yarn.jar=hdfs://namenode:8020/path/to/spark-assembly-*.jar

This command line does not look correct. "spark.yarn.jar" is not a JVM
command line option. You most probably need a "-D" before that.

> What is the equivalent property to use for the LOCAL MODE case?

What do you mean local mode? --master local? A local file?

That location is a URL, so set it to what makes sense in your case. See also:
http://spark.apache.org/docs/latest/submitting-applications.html

(Under "Advanced Dependency Management".)

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org