You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Dimension Data, LLC." <su...@didata.us> on 2014/09/17 00:22:33 UTC

Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...


Hello friends:

Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN 
distribution. Everything went fine, and everything seems
to work, but for the following.

Following are two invocations of the 'pyspark' script, one with 
enclosing quotes around the options passed to
'--driver-java-options', and one without them. I added the following 
one-line in the 'pyspark' script to
show my problem...

ADDED: echo "xxx${PYSPARK_SUBMIT_ARGS}xxx" # Added after the line that 
exports this variable.

=========================================================

FIRST:
[ without enclosing quotes ]:

user@linux$ pyspark --master yarn-client --driver-java-options 
-Dspark.executor.memory=1G -Dspark.ui.port=8468 
-Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M 
-Dspark.executor.instances=3 
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar
xxx --master yarn-client --driver-java-options 
-Dspark.executor.memory=1Gxxx <--- echo statement show option truncation.

While this succeeds in getting to a pyspark shell prompt (sc), the 
context isn't setup properly because, as seen
in red above and below, all but the first option took effect. (Note 
spark.executor.memory is correct but that's only because
my spark defaults coincide with it.)

14/09/16 17:35:32 INFO yarn.Client:   command: $JAVA_HOME/bin/java 
-server -Xmx512m -Djava.io.tmpdir=$PWD/tmp 
'-Dspark.tachyonStore.folderName=spark-e225c04d-5333-4ca6-9a78-1c3392438d89' 
'-Dspark.serializer.objectStreamReset=100' '-Dspark.executor.memory=1G' 
'-Dspark.rdd.compress=True' '-Dspark.yarn.secondary.jars=' 
'-Dspark.submit.pyFiles=' 
'-Dspark.serializer=org.apache.spark.serializer.KryoSerializer' 
'-Dspark.driver.host=dstorm' '-Dspark.driver.appUIHistoryAddress=' 
'-Dspark.app.name=PySparkShell' 
'-Dspark.driver.appUIAddress=dstorm:4040' 
'-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G' 
'-Dspark.fileserver.uri=http://192.168.0.16:60305' 
'-Dspark.driver.port=44616' '-Dspark.master=yarn-client' 
org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar  
null  --arg  'dstorm:44616' --executor-memory 1024 --executor-cores 1 
--num-executors  2 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr

(Note: I happen to notice that 'spark.driver.memory' is missing as well).

===========================================

NEXT:

[ So let's try with enclosing quotes ]
     user@linux$ pyspark --master yarn-client --driver-java-options 
'-Dspark.executor.memory=1G -Dspark.ui.port=8468 
-Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M 
-Dspark.executor.instances=3 
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
xxx --master yarn-client --driver-java-options 
"-Dspark.executor.memory=1G -Dspark.ui.port=8468 
-Dspark.driver.memory=512M -Dspark.yarn.executor.memoryOverhead=512M 
-Dspark.executor.instances=3 
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar"xxx

While this does have all the options (shown in the red echo output above 
and the command executed below), pyspark invocation fails, indicating
that the application ended before I got to a shell prompt.
See below snippet.

14/09/16 17:44:12 INFO yarn.Client:   command: $JAVA_HOME/bin/java 
-server -Xmx512m -Djava.io.tmpdir=$PWD/tmp 
'-Dspark.tachyonStore.folderName=spark-3b62ece7-a22a-4d0a-b773-1f5601e5eada' 
'-Dspark.executor.memory=1G' '-Dspark.driver.memory=512M' 
'-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar' 
'-Dspark.serializer.objectStreamReset=100' 
'-Dspark.executor.instances=3' '-Dspark.rdd.compress=True' 
'-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles=' 
'-Dspark.ui.port=8468' '-Dspark.driver.host=dstorm' 
'-Dspark.serializer=org.apache.spark.serializer.KryoSerializer' 
'-Dspark.driver.appUIHistoryAddress=' '-Dspark.app.name=PySparkShell' 
'-Dspark.driver.appUIAddress=dstorm:8468' 
'-Dspark.yarn.executor.memoryOverhead=512M' 
'-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G 
-Dspark.ui.port=8468 -Dspark.driver.memory=512M 
-Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3 
-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar' 
'-Dspark.fileserver.uri=http://192.168.0.16:54171' 
'-Dspark.master=yarn-client' '-Dspark.driver.port=58542' 
org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar  
null  --arg  'dstorm:58542' --executor-memory 1024 --executor-cores 1 
--num-executors  3 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr


[ ... SNIP ... ]
4/09/16 17:44:12 INFO cluster.YarnClientSchedulerBackend: Application 
report from ASM:
      appMasterRpcPort: -1
      appStartTime: 1410903852044
      yarnAppState: ACCEPTED

14/09/16 17:44:13 INFO cluster.YarnClientSchedulerBackend: Application 
report from ASM:
      appMasterRpcPort: -1
      appStartTime: 1410903852044
      yarnAppState: ACCEPTED

14/09/16 17:44:14 INFO cluster.YarnClientSchedulerBackend: Application 
report from ASM:
      appMasterRpcPort: -1
      appStartTime: 1410903852044
      yarnAppState: ACCEPTED

14/09/16 17:44:15 INFO cluster.YarnClientSchedulerBackend: Application 
report from ASM:
      appMasterRpcPort: 0
      appStartTime: 1410903852044
      yarnAppState: RUNNING

14/09/16 17:44:19 ERROR cluster.YarnClientSchedulerBackend: Yarn 
application already ended: FAILED


Am I doing something wrong?

Thank you in advance!
Team didata

Re: Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

Posted by Sandy Ryza <sa...@cloudera.com>.

Agreed with Andrew - the Spark properties file is usually a much nicer way
of specifying a bunch of properties.  To be clear, --driver-memory still
needs to go on the command line.

-Sandy

On Mon, Sep 22, 2014 at 1:49 AM, Andrew Or <an...@databricks.com> wrote:

> Hi Didata,
>
> An alternative to what Sandy proposed is to set the Spark properties in a
> special file `conf/spark-defaults.conf`. That way you don't have to specify
> all the configs through the command line every time. The `--conf` option is
> mostly intended to change one or two parameters, but it becomes cumbersome
> to specify `--conf` many times, one for each config you have.
>
> In general, when a Spark setting refers to "java options", it applies to
> non-Spark property java options (e.g. -Xmx5g, or -XX:-UseParallelGC). The
> recommended way of setting Spark-specific properties is documented here:
> http://spark.apache.org/docs/latest/configuration.html).
>
> Also, as an aside (you may already know this), but to pinpoint exactly
> what went wrong with your executors in Yarn, you can visit the
> ResourceManager's web UI and click into your application. If the Hadoop
> JobHistoryServer is setup properly, it will redirect you to logs of the
> failed executors.
>
> Let me know if you have more questions,
> -Andrew
>
>
>
> 2014-09-21 10:38 GMT-07:00 Sandy Ryza <sa...@cloudera.com>:
>
> If using a client deploy mode, the driver memory can't go through --conf.
>>  spark-submit handles --driver-memory as a special case because it needs to
>> know how much memory to give the JVM before starting it and interpreting
>> the other properties.
>>
>> -Sandy
>>
>> On Tue, Sep 16, 2014 at 10:20 PM, Dimension Data, LLC. <
>> subscriptions@didata.us> wrote:
>>
>>>  Hi Sandy:
>>>
>>> Thank you. I have not tried that mechanism (I wasn't are of it). I will
>>> try that instead.
>>>
>>> Is it possible to also represent '--driver-memory' and
>>> '--executor-memory' (and basically all properties)
>>> using the '--conf' directive?
>>>
>>> The Reason: I actually discovered the below issue while writing a custom
>>> PYTHONSTARTUP script that I use
>>> to launch *bpython* or *python* or my *WING python IDE* with. That
>>> script reads a python *dict* (from a file)
>>> containing key/value pairs from which it constructs the
>>> "--driver-java-options ...", which I will now
>>> switch generate '--conf key1=val1 --conf key2=val2 --conf key3=val3 (and
>>> so on), instead.
>>>
>>> If all of the properties could be represented in this way, then it makes
>>> the code cleaner (all in
>>> the dict file, and no one-offs).
>>>
>>> Either way, thank you. =:)
>>>
>>> Noel,
>>> team didata
>>>
>>>
>>>  On 09/16/2014 08:03 PM, Sandy Ryza wrote:
>>>
>>> Hi team didata,
>>>
>>>  This doesn't directly answer your question, but with Spark 1.1,
>>> instead of user the driver options, it's better to pass your spark
>>> properties using the "conf" option.
>>>
>>>  E.g.
>>> pyspark --master yarn-client --conf spark.shuffle.spill=true --conf
>>> spark.yarn.executor.memoryOverhead=512M
>>>
>>>  Additionally, executor and memory have dedicated options:
>>>
>>>  pyspark --master yarn-client --conf spark.shuffle.spill=true --conf
>>> spark.yarn.executor.memoryOverhead=512M --driver-memory 3G
>>> --executor-memory 5G
>>>
>>>  -Sandy
>>>
>>>
>>> On Tue, Sep 16, 2014 at 6:22 PM, Dimension Data, LLC. <
>>> subscriptions@didata.us> wrote:
>>>
>>>>
>>>>
>>>> Hello friends:
>>>>
>>>> Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN
>>>> distribution. Everything went fine, and everything seems
>>>> to work, but for the following.
>>>>
>>>> Following are two invocations of the 'pyspark' script, one with
>>>> enclosing quotes around the options passed to
>>>> '--driver-java-options', and one without them. I added the following
>>>> one-line in the 'pyspark' script to
>>>> show my problem...
>>>>
>>>> ADDED: echo "xxx${PYSPARK_SUBMIT_ARGS}xxx" # Added after the line that
>>>> exports this variable.
>>>>
>>>> =========================================================
>>>>
>>>> FIRST:
>>>> [ without enclosing quotes ]:
>>>>
>>>>     user@linux$ pyspark --master yarn-client --driver-java-options
>>>> -Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M
>>>> -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3
>>>> -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar
>>>> xxx --master yarn-client --driver-java-options
>>>> -Dspark.executor.memory=1Gxxx  <--- echo statement show option truncation.
>>>>
>>>> While this succeeds in getting to a pyspark shell prompt (sc), the
>>>> context isn't setup properly because, as seen
>>>> in red above and below, all but the first option took effect. (Note
>>>> spark.executor.memory is correct but that's only because
>>>> my spark defaults coincide with it.)
>>>>
>>>> 14/09/16 17:35:32 INFO yarn.Client:   command: $JAVA_HOME/bin/java
>>>> -server -Xmx512m -Djava.io.tmpdir=$PWD/tmp
>>>> '-Dspark.tachyonStore.folderName=spark-e225c04d-5333-4ca6-9a78-1c3392438d89'
>>>> '-Dspark.serializer.objectStreamReset=100' '-Dspark.executor.memory=1G'
>>>> '-Dspark.rdd.compress=True' '-Dspark.yarn.secondary.jars='
>>>> '-Dspark.submit.pyFiles='
>>>> '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
>>>> '-Dspark.driver.host=dstorm' '-Dspark.driver.appUIHistoryAddress=' '-
>>>> Dspark.app.name=PySparkShell' '-Dspark.driver.appUIAddress=dstorm:4040'
>>>> '-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G'
>>>> '-Dspark.fileserver.uri=http://192.168.0.16:60305'
>>>> '-Dspark.driver.port=44616' '-Dspark.master=yarn-client'
>>>> org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar
>>>> null  --arg  'dstorm:44616' --executor-memory 1024 --executor-cores 1 --num-executors
>>>> 2 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
>>>>
>>>> (Note: I happen to notice that 'spark.driver.memory' is missing as
>>>> well).
>>>>
>>>> ===========================================
>>>>
>>>> NEXT:
>>>>
>>>> [ So let's try with enclosing quotes ]
>>>>     user@linux$ pyspark --master yarn-client --driver-java-options
>>>> '-Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M
>>>> -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3
>>>> -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
>>>> xxx --master yarn-client --driver-java-options
>>>> "-Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M
>>>> -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3
>>>> -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar"xxx
>>>>
>>>> While this does have all the options (shown in the red echo output
>>>> above and the command executed below), pyspark invocation fails, indicating
>>>> that the application ended before I got to a shell prompt.
>>>> See below snippet.
>>>>
>>>> 14/09/16 17:44:12 INFO yarn.Client:   command: $JAVA_HOME/bin/java
>>>> -server -Xmx512m -Djava.io.tmpdir=$PWD/tmp
>>>> '-Dspark.tachyonStore.folderName=spark-3b62ece7-a22a-4d0a-b773-1f5601e5eada'
>>>> '-Dspark.executor.memory=1G' '-Dspark.driver.memory=512M'
>>>> '-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
>>>> '-Dspark.serializer.objectStreamReset=100' '-D
>>>> spark.executor.instances=3' '-Dspark.rdd.compress=True'
>>>> '-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles=' '-Ds
>>>> park.ui.port=8468' '-Dspark.driver.host=dstorm'
>>>> '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
>>>> '-Dspark.driver.appUIHistoryAddress=' '-Dspark.app.name=PySparkShell'
>>>> '-Dspark.driver.appUIAddress=dstorm:8468' '-D
>>>> spark.yarn.executor.memoryOverhead=512M'
>>>> '-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G
>>>> -Dspark.ui.port=8468 -Dspark.driver.memory=512M
>>>> -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3
>>>> -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
>>>> '-Dspark.fileserver.uri=http://192.168.0.16:54171'
>>>> '-Dspark.master=yarn-client' '-Dspark.driver.port=58542'
>>>> org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar
>>>> null  --arg  'dstorm:58542' --executor-memory 1024 --executor-cores 1
>>>> --num-executors  3 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
>>>>
>>>>
>>>> [ ... SNIP ... ]
>>>> 4/09/16 17:44:12 INFO cluster.YarnClientSchedulerBackend: Application
>>>> report from ASM:
>>>>      appMasterRpcPort: -1
>>>>      appStartTime: 1410903852044
>>>>      yarnAppState: ACCEPTED
>>>>
>>>> 14/09/16 17:44:13 INFO cluster.YarnClientSchedulerBackend: Application
>>>> report from ASM:
>>>>      appMasterRpcPort: -1
>>>>      appStartTime: 1410903852044
>>>>      yarnAppState: ACCEPTED
>>>>
>>>> 14/09/16 17:44:14 INFO cluster.YarnClientSchedulerBackend: Application
>>>> report from ASM:
>>>>      appMasterRpcPort: -1
>>>>      appStartTime: 1410903852044
>>>>      yarnAppState: ACCEPTED
>>>>
>>>> 14/09/16 17:44:15 INFO cluster.YarnClientSchedulerBackend: Application
>>>> report from ASM:
>>>>      appMasterRpcPort: 0
>>>>      appStartTime: 1410903852044
>>>>      yarnAppState: RUNNING
>>>>
>>>> 14/09/16 17:44:19 ERROR cluster.YarnClientSchedulerBackend: Yarn
>>>> application already ended: FAILED
>>>>
>>>>
>>>> Am I doing something wrong?
>>>>
>>>> Thank you in advance!
>>>> Team didata
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

Posted by Andrew Or <an...@databricks.com>.

Hi Didata,

An alternative to what Sandy proposed is to set the Spark properties in a
special file `conf/spark-defaults.conf`. That way you don't have to specify
all the configs through the command line every time. The `--conf` option is
mostly intended to change one or two parameters, but it becomes cumbersome
to specify `--conf` many times, one for each config you have.

In general, when a Spark setting refers to "java options", it applies to
non-Spark property java options (e.g. -Xmx5g, or -XX:-UseParallelGC). The
recommended way of setting Spark-specific properties is documented here:
http://spark.apache.org/docs/latest/configuration.html).

Also, as an aside (you may already know this), but to pinpoint exactly what
went wrong with your executors in Yarn, you can visit the ResourceManager's
web UI and click into your application. If the Hadoop JobHistoryServer is
setup properly, it will redirect you to logs of the failed executors.

Let me know if you have more questions,
-Andrew



2014-09-21 10:38 GMT-07:00 Sandy Ryza <sa...@cloudera.com>:

> If using a client deploy mode, the driver memory can't go through --conf.
>  spark-submit handles --driver-memory as a special case because it needs to
> know how much memory to give the JVM before starting it and interpreting
> the other properties.
>
> -Sandy
>
> On Tue, Sep 16, 2014 at 10:20 PM, Dimension Data, LLC. <
> subscriptions@didata.us> wrote:
>
>>  Hi Sandy:
>>
>> Thank you. I have not tried that mechanism (I wasn't are of it). I will
>> try that instead.
>>
>> Is it possible to also represent '--driver-memory' and
>> '--executor-memory' (and basically all properties)
>> using the '--conf' directive?
>>
>> The Reason: I actually discovered the below issue while writing a custom
>> PYTHONSTARTUP script that I use
>> to launch *bpython* or *python* or my *WING python IDE* with. That
>> script reads a python *dict* (from a file)
>> containing key/value pairs from which it constructs the
>> "--driver-java-options ...", which I will now
>> switch generate '--conf key1=val1 --conf key2=val2 --conf key3=val3 (and
>> so on), instead.
>>
>> If all of the properties could be represented in this way, then it makes
>> the code cleaner (all in
>> the dict file, and no one-offs).
>>
>> Either way, thank you. =:)
>>
>> Noel,
>> team didata
>>
>>
>>  On 09/16/2014 08:03 PM, Sandy Ryza wrote:
>>
>> Hi team didata,
>>
>>  This doesn't directly answer your question, but with Spark 1.1, instead
>> of user the driver options, it's better to pass your spark properties using
>> the "conf" option.
>>
>>  E.g.
>> pyspark --master yarn-client --conf spark.shuffle.spill=true --conf
>> spark.yarn.executor.memoryOverhead=512M
>>
>>  Additionally, executor and memory have dedicated options:
>>
>>  pyspark --master yarn-client --conf spark.shuffle.spill=true --conf
>> spark.yarn.executor.memoryOverhead=512M --driver-memory 3G
>> --executor-memory 5G
>>
>>  -Sandy
>>
>>
>> On Tue, Sep 16, 2014 at 6:22 PM, Dimension Data, LLC. <
>> subscriptions@didata.us> wrote:
>>
>>>
>>>
>>> Hello friends:
>>>
>>> Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN
>>> distribution. Everything went fine, and everything seems
>>> to work, but for the following.
>>>
>>> Following are two invocations of the 'pyspark' script, one with
>>> enclosing quotes around the options passed to
>>> '--driver-java-options', and one without them. I added the following
>>> one-line in the 'pyspark' script to
>>> show my problem...
>>>
>>> ADDED: echo "xxx${PYSPARK_SUBMIT_ARGS}xxx" # Added after the line that
>>> exports this variable.
>>>
>>> =========================================================
>>>
>>> FIRST:
>>> [ without enclosing quotes ]:
>>>
>>>     user@linux$ pyspark --master yarn-client --driver-java-options
>>> -Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M
>>> -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3
>>> -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar
>>> xxx --master yarn-client --driver-java-options
>>> -Dspark.executor.memory=1Gxxx  <--- echo statement show option truncation.
>>>
>>> While this succeeds in getting to a pyspark shell prompt (sc), the
>>> context isn't setup properly because, as seen
>>> in red above and below, all but the first option took effect. (Note
>>> spark.executor.memory is correct but that's only because
>>> my spark defaults coincide with it.)
>>>
>>> 14/09/16 17:35:32 INFO yarn.Client:   command: $JAVA_HOME/bin/java
>>> -server -Xmx512m -Djava.io.tmpdir=$PWD/tmp
>>> '-Dspark.tachyonStore.folderName=spark-e225c04d-5333-4ca6-9a78-1c3392438d89'
>>> '-Dspark.serializer.objectStreamReset=100' '-Dspark.executor.memory=1G'
>>> '-Dspark.rdd.compress=True' '-Dspark.yarn.secondary.jars='
>>> '-Dspark.submit.pyFiles='
>>> '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
>>> '-Dspark.driver.host=dstorm' '-Dspark.driver.appUIHistoryAddress=' '-
>>> Dspark.app.name=PySparkShell' '-Dspark.driver.appUIAddress=dstorm:4040'
>>> '-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G'
>>> '-Dspark.fileserver.uri=http://192.168.0.16:60305'
>>> '-Dspark.driver.port=44616' '-Dspark.master=yarn-client'
>>> org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar
>>> null  --arg  'dstorm:44616' --executor-memory 1024 --executor-cores 1 --num-executors
>>> 2 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
>>>
>>> (Note: I happen to notice that 'spark.driver.memory' is missing as well).
>>>
>>> ===========================================
>>>
>>> NEXT:
>>>
>>> [ So let's try with enclosing quotes ]
>>>     user@linux$ pyspark --master yarn-client --driver-java-options
>>> '-Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M
>>> -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3
>>> -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
>>> xxx --master yarn-client --driver-java-options
>>> "-Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M
>>> -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3
>>> -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar"xxx
>>>
>>> While this does have all the options (shown in the red echo output above
>>> and the command executed below), pyspark invocation fails, indicating
>>> that the application ended before I got to a shell prompt.
>>> See below snippet.
>>>
>>> 14/09/16 17:44:12 INFO yarn.Client:   command: $JAVA_HOME/bin/java
>>> -server -Xmx512m -Djava.io.tmpdir=$PWD/tmp
>>> '-Dspark.tachyonStore.folderName=spark-3b62ece7-a22a-4d0a-b773-1f5601e5eada'
>>> '-Dspark.executor.memory=1G' '-Dspark.driver.memory=512M'
>>> '-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
>>> '-Dspark.serializer.objectStreamReset=100' '-Dspark.executor.instances=3'
>>> '-Dspark.rdd.compress=True' '-Dspark.yarn.secondary.jars='
>>> '-Dspark.submit.pyFiles=' '-Dspark.ui.port=8468'
>>> '-Dspark.driver.host=dstorm'
>>> '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
>>> '-Dspark.driver.appUIHistoryAddress=' '-Dspark.app.name=PySparkShell'
>>> '-Dspark.driver.appUIAddress=dstorm:8468' '-D
>>> spark.yarn.executor.memoryOverhead=512M'
>>> '-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G
>>> -Dspark.ui.port=8468 -Dspark.driver.memory=512M
>>> -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3
>>> -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
>>> '-Dspark.fileserver.uri=http://192.168.0.16:54171'
>>> '-Dspark.master=yarn-client' '-Dspark.driver.port=58542'
>>> org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar
>>> null  --arg  'dstorm:58542' --executor-memory 1024 --executor-cores 1
>>> --num-executors  3 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
>>>
>>>
>>> [ ... SNIP ... ]
>>> 4/09/16 17:44:12 INFO cluster.YarnClientSchedulerBackend: Application
>>> report from ASM:
>>>      appMasterRpcPort: -1
>>>      appStartTime: 1410903852044
>>>      yarnAppState: ACCEPTED
>>>
>>> 14/09/16 17:44:13 INFO cluster.YarnClientSchedulerBackend: Application
>>> report from ASM:
>>>      appMasterRpcPort: -1
>>>      appStartTime: 1410903852044
>>>      yarnAppState: ACCEPTED
>>>
>>> 14/09/16 17:44:14 INFO cluster.YarnClientSchedulerBackend: Application
>>> report from ASM:
>>>      appMasterRpcPort: -1
>>>      appStartTime: 1410903852044
>>>      yarnAppState: ACCEPTED
>>>
>>> 14/09/16 17:44:15 INFO cluster.YarnClientSchedulerBackend: Application
>>> report from ASM:
>>>      appMasterRpcPort: 0
>>>      appStartTime: 1410903852044
>>>      yarnAppState: RUNNING
>>>
>>> 14/09/16 17:44:19 ERROR cluster.YarnClientSchedulerBackend: Yarn
>>> application already ended: FAILED
>>>
>>>
>>> Am I doing something wrong?
>>>
>>> Thank you in advance!
>>> Team didata
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

Posted by Sandy Ryza <sa...@cloudera.com>.

If using a client deploy mode, the driver memory can't go through --conf.
 spark-submit handles --driver-memory as a special case because it needs to
know how much memory to give the JVM before starting it and interpreting
the other properties.

-Sandy

On Tue, Sep 16, 2014 at 10:20 PM, Dimension Data, LLC. <
subscriptions@didata.us> wrote:

>  Hi Sandy:
>
> Thank you. I have not tried that mechanism (I wasn't are of it). I will
> try that instead.
>
> Is it possible to also represent '--driver-memory' and '--executor-memory'
> (and basically all properties)
> using the '--conf' directive?
>
> The Reason: I actually discovered the below issue while writing a custom
> PYTHONSTARTUP script that I use
> to launch *bpython* or *python* or my *WING python IDE* with. That script
> reads a python *dict* (from a file)
> containing key/value pairs from which it constructs the
> "--driver-java-options ...", which I will now
> switch generate '--conf key1=val1 --conf key2=val2 --conf key3=val3 (and
> so on), instead.
>
> If all of the properties could be represented in this way, then it makes
> the code cleaner (all in
> the dict file, and no one-offs).
>
> Either way, thank you. =:)
>
> Noel,
> team didata
>
>
>  On 09/16/2014 08:03 PM, Sandy Ryza wrote:
>
> Hi team didata,
>
>  This doesn't directly answer your question, but with Spark 1.1, instead
> of user the driver options, it's better to pass your spark properties using
> the "conf" option.
>
>  E.g.
> pyspark --master yarn-client --conf spark.shuffle.spill=true --conf
> spark.yarn.executor.memoryOverhead=512M
>
>  Additionally, executor and memory have dedicated options:
>
>  pyspark --master yarn-client --conf spark.shuffle.spill=true --conf
> spark.yarn.executor.memoryOverhead=512M --driver-memory 3G
> --executor-memory 5G
>
>  -Sandy
>
>
> On Tue, Sep 16, 2014 at 6:22 PM, Dimension Data, LLC. <
> subscriptions@didata.us> wrote:
>
>>
>>
>> Hello friends:
>>
>> Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN distribution.
>> Everything went fine, and everything seems
>> to work, but for the following.
>>
>> Following are two invocations of the 'pyspark' script, one with enclosing
>> quotes around the options passed to
>> '--driver-java-options', and one without them. I added the following
>> one-line in the 'pyspark' script to
>> show my problem...
>>
>> ADDED: echo "xxx${PYSPARK_SUBMIT_ARGS}xxx" # Added after the line that
>> exports this variable.
>>
>> =========================================================
>>
>> FIRST:
>> [ without enclosing quotes ]:
>>
>>     user@linux$ pyspark --master yarn-client --driver-java-options
>> -Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M
>> -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3
>> -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar
>> xxx --master yarn-client --driver-java-options
>> -Dspark.executor.memory=1Gxxx  <--- echo statement show option truncation.
>>
>> While this succeeds in getting to a pyspark shell prompt (sc), the
>> context isn't setup properly because, as seen
>> in red above and below, all but the first option took effect. (Note
>> spark.executor.memory is correct but that's only because
>> my spark defaults coincide with it.)
>>
>> 14/09/16 17:35:32 INFO yarn.Client:   command: $JAVA_HOME/bin/java
>> -server -Xmx512m -Djava.io.tmpdir=$PWD/tmp
>> '-Dspark.tachyonStore.folderName=spark-e225c04d-5333-4ca6-9a78-1c3392438d89'
>> '-Dspark.serializer.objectStreamReset=100' '-Dspark.executor.memory=1G'
>> '-Dspark.rdd.compress=True' '-Dspark.yarn.secondary.jars='
>> '-Dspark.submit.pyFiles='
>> '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
>> '-Dspark.driver.host=dstorm' '-Dspark.driver.appUIHistoryAddress=' '-
>> Dspark.app.name=PySparkShell' '-Dspark.driver.appUIAddress=dstorm:4040'
>> '-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G'
>> '-Dspark.fileserver.uri=http://192.168.0.16:60305'
>> '-Dspark.driver.port=44616' '-Dspark.master=yarn-client'
>> org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar
>> null  --arg  'dstorm:44616' --executor-memory 1024 --executor-cores 1 --num-executors
>> 2 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
>>
>> (Note: I happen to notice that 'spark.driver.memory' is missing as well).
>>
>> ===========================================
>>
>> NEXT:
>>
>> [ So let's try with enclosing quotes ]
>>     user@linux$ pyspark --master yarn-client --driver-java-options
>> '-Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M
>> -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3
>> -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
>> xxx --master yarn-client --driver-java-options
>> "-Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M
>> -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3
>> -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar"xxx
>>
>> While this does have all the options (shown in the red echo output above
>> and the command executed below), pyspark invocation fails, indicating
>> that the application ended before I got to a shell prompt.
>> See below snippet.
>>
>> 14/09/16 17:44:12 INFO yarn.Client:   command: $JAVA_HOME/bin/java
>> -server -Xmx512m -Djava.io.tmpdir=$PWD/tmp
>> '-Dspark.tachyonStore.folderName=spark-3b62ece7-a22a-4d0a-b773-1f5601e5eada'
>> '-Dspark.executor.memory=1G' '-Dspark.driver.memory=512M'
>> '-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
>> '-Dspark.serializer.objectStreamReset=100' '-Dspark.executor.instances=3'
>> '-Dspark.rdd.compress=True' '-Dspark.yarn.secondary.jars='
>> '-Dspark.submit.pyFiles=' '-Dspark.ui.port=8468'
>> '-Dspark.driver.host=dstorm'
>> '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
>> '-Dspark.driver.appUIHistoryAddress=' '-Dspark.app.name=PySparkShell'
>> '-Dspark.driver.appUIAddress=dstorm:8468' '-D
>> spark.yarn.executor.memoryOverhead=512M'
>> '-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G
>> -Dspark.ui.port=8468 -Dspark.driver.memory=512M
>> -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3
>> -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
>> '-Dspark.fileserver.uri=http://192.168.0.16:54171'
>> '-Dspark.master=yarn-client' '-Dspark.driver.port=58542'
>> org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar
>> null  --arg  'dstorm:58542' --executor-memory 1024 --executor-cores 1
>> --num-executors  3 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
>>
>>
>> [ ... SNIP ... ]
>> 4/09/16 17:44:12 INFO cluster.YarnClientSchedulerBackend: Application
>> report from ASM:
>>      appMasterRpcPort: -1
>>      appStartTime: 1410903852044
>>      yarnAppState: ACCEPTED
>>
>> 14/09/16 17:44:13 INFO cluster.YarnClientSchedulerBackend: Application
>> report from ASM:
>>      appMasterRpcPort: -1
>>      appStartTime: 1410903852044
>>      yarnAppState: ACCEPTED
>>
>> 14/09/16 17:44:14 INFO cluster.YarnClientSchedulerBackend: Application
>> report from ASM:
>>      appMasterRpcPort: -1
>>      appStartTime: 1410903852044
>>      yarnAppState: ACCEPTED
>>
>> 14/09/16 17:44:15 INFO cluster.YarnClientSchedulerBackend: Application
>> report from ASM:
>>      appMasterRpcPort: 0
>>      appStartTime: 1410903852044
>>      yarnAppState: RUNNING
>>
>> 14/09/16 17:44:19 ERROR cluster.YarnClientSchedulerBackend: Yarn
>> application already ended: FAILED
>>
>>
>> Am I doing something wrong?
>>
>> Thank you in advance!
>> Team didata
>>
>>
>>
>>
>>
>
>

Re: Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

Posted by "Dimension Data, LLC." <su...@didata.us>.

Hi Sandy:

Thank you. I have not tried that mechanism (I wasn't are of it). I will 
try that instead.

Is it possible to also represent '--driver-memory' and 
'--executor-memory' (and basically all properties)
using the '--conf' directive?

The Reason: I actually discovered the below issue while writing a custom 
PYTHONSTARTUP script that I use
to launch *bpython* or *python* or my *WING python IDE* with. That 
script reads a python *dict* (from a file)
containing key/value pairs from which it constructs the 
"--driver-java-options ...", which I will now
switch generate '--conf key1=val1 --conf key2=val2 --conf key3=val3 (and 
so on), instead.

If all of the properties could be represented in this way, then it makes 
the code cleaner (all in
the dict file, and no one-offs).

Either way, thank you. =:)

Noel,
team didata


On 09/16/2014 08:03 PM, Sandy Ryza wrote:
> Hi team didata,
>
> This doesn't directly answer your question, but with Spark 1.1, 
> instead of user the driver options, it's better to pass your spark 
> properties using the "conf" option.
>
> E.g.
> pyspark --master yarn-client --conf spark.shuffle.spill=true --conf 
> spark.yarn.executor.memoryOverhead=512M
>
> Additionally, executor and memory have dedicated options:
>
> pyspark --master yarn-client --conf spark.shuffle.spill=true --conf 
> spark.yarn.executor.memoryOverhead=512M --driver-memory 3G 
> --executor-memory 5G
>
> -Sandy
>
>
> On Tue, Sep 16, 2014 at 6:22 PM, Dimension Data, LLC. 
> <subscriptions@didata.us <ma...@didata.us>> wrote:
>
>
>
>     Hello friends:
>
>     Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN
>     distribution. Everything went fine, and everything seems
>     to work, but for the following.
>
>     Following are two invocations of the 'pyspark' script, one with
>     enclosing quotes around the options passed to
>     '--driver-java-options', and one without them. I added the
>     following one-line in the 'pyspark' script to
>     show my problem...
>
>     ADDED: echo "xxx${PYSPARK_SUBMIT_ARGS}xxx" # Added after the line
>     that exports this variable.
>
>     =========================================================
>
>     FIRST:
>     [ without enclosing quotes ]:
>
>     user@linux$ pyspark --master yarn-client --driver-java-options
>     -Dspark.executor.memory=1G -Dspark.ui.port=8468
>     -Dspark.driver.memory=512M
>     -Dspark.yarn.executor.memoryOverhead=512M
>     -Dspark.executor.instances=3
>     -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar
>     xxx --master yarn-client --driver-java-options
>     -Dspark.executor.memory=1Gxxx  <--- echo statement show option
>     truncation.
>
>     While this succeeds in getting to a pyspark shell prompt (sc), the
>     context isn't setup properly because, as seen
>     in red above and below, all but the first option took effect.
>     (Note spark.executor.memory is correct but that's only because
>     my spark defaults coincide with it.)
>
>     14/09/16 17:35:32 INFO yarn.Client:   command: $JAVA_HOME/bin/java
>     -server -Xmx512m -Djava.io.tmpdir=$PWD/tmp
>     '-Dspark.tachyonStore.folderName=spark-e225c04d-5333-4ca6-9a78-1c3392438d89'
>     '-Dspark.serializer.objectStreamReset=100'
>     '-Dspark.executor.memory=1G' '-Dspark.rdd.compress=True'
>     '-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles='
>     '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
>     '-Dspark.driver.host=dstorm' '-Dspark.driver.appUIHistoryAddress='
>     '-Dspark.app.name <http://Dspark.app.name>=PySparkShell'
>     '-Dspark.driver.appUIAddress=dstorm:4040'
>     '-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G'
>     '-Dspark.fileserver.uri=http://192.168.0.16:60305'
>     '-Dspark.driver.port=44616' '-Dspark.master=yarn-client'
>     org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused'
>     --jar  null  --arg 'dstorm:44616' --executor-memory 1024
>     --executor-cores 1 --num-executors 2 1> <LOG_DIR>/stdout 2>
>     <LOG_DIR>/stderr
>
>     (Note: I happen to notice that 'spark.driver.memory' is missing as
>     well).
>
>     ===========================================
>
>     NEXT:
>
>     [ So let's try with enclosing quotes ]
>         user@linux$ pyspark --master yarn-client --driver-java-options
>     '-Dspark.executor.memory=1G -Dspark.ui.port=8468
>     -Dspark.driver.memory=512M
>     -Dspark.yarn.executor.memoryOverhead=512M
>     -Dspark.executor.instances=3
>     -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
>     xxx --master yarn-client --driver-java-options
>     "-Dspark.executor.memory=1G -Dspark.ui.port=8468
>     -Dspark.driver.memory=512M
>     -Dspark.yarn.executor.memoryOverhead=512M
>     -Dspark.executor.instances=3
>     -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar"xxx
>
>     While this does have all the options (shown in the red echo output
>     above and the command executed below), pyspark invocation fails,
>     indicating
>     that the application ended before I got to a shell prompt.
>     See below snippet.
>
>     14/09/16 17:44:12 INFO yarn.Client:   command: $JAVA_HOME/bin/java
>     -server -Xmx512m -Djava.io.tmpdir=$PWD/tmp
>     '-Dspark.tachyonStore.folderName=spark-3b62ece7-a22a-4d0a-b773-1f5601e5eada'
>     '-Dspark.executor.memory=1G' '-Dspark.driver.memory=512M'
>     '-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
>     '-Dspark.serializer.objectStreamReset=100'
>     '-Dspark.executor.instances=3' '-Dspark.rdd.compress=True'
>     '-Dspark.yarn.secondary.jars=' '-Dspark.submit.pyFiles='
>     '-Dspark.ui.port=8468' '-Dspark.driver.host=dstorm'
>     '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
>     '-Dspark.driver.appUIHistoryAddress=' '-Dspark.app.name
>     <http://Dspark.app.name>=PySparkShell'
>     '-Dspark.driver.appUIAddress=dstorm:8468'
>     '-Dspark.yarn.executor.memoryOverhead=512M'
>     '-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G
>     -Dspark.ui.port=8468 -Dspark.driver.memory=512M
>     -Dspark.yarn.executor.memoryOverhead=512M
>     -Dspark.executor.instances=3
>     -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
>     '-Dspark.fileserver.uri=http://192.168.0.16:54171'
>     '-Dspark.master=yarn-client' '-Dspark.driver.port=58542'
>     org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused'
>     --jar  null  --arg  'dstorm:58542' --executor-memory 1024
>     --executor-cores 1 --num-executors  3 1> <LOG_DIR>/stdout 2>
>     <LOG_DIR>/stderr
>
>
>     [ ... SNIP ... ]
>     4/09/16 17:44:12 INFO cluster.YarnClientSchedulerBackend:
>     Application report from ASM:
>          appMasterRpcPort: -1
>          appStartTime: 1410903852044
>          yarnAppState: ACCEPTED
>
>     14/09/16 17:44:13 INFO cluster.YarnClientSchedulerBackend:
>     Application report from ASM:
>          appMasterRpcPort: -1
>          appStartTime: 1410903852044
>          yarnAppState: ACCEPTED
>
>     14/09/16 17:44:14 INFO cluster.YarnClientSchedulerBackend:
>     Application report from ASM:
>          appMasterRpcPort: -1
>          appStartTime: 1410903852044
>          yarnAppState: ACCEPTED
>
>     14/09/16 17:44:15 INFO cluster.YarnClientSchedulerBackend:
>     Application report from ASM:
>          appMasterRpcPort: 0
>          appStartTime: 1410903852044
>          yarnAppState: RUNNING
>
>     14/09/16 17:44:19 ERROR cluster.YarnClientSchedulerBackend: Yarn
>     application already ended: FAILED
>
>
>     Am I doing something wrong?
>
>     Thank you in advance!
>     Team didata
>
>
>
>
>

Re: Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi team didata,

This doesn't directly answer your question, but with Spark 1.1, instead of
user the driver options, it's better to pass your spark properties using
the "conf" option.

E.g.
pyspark --master yarn-client --conf spark.shuffle.spill=true --conf
spark.yarn.executor.memoryOverhead=512M

Additionally, executor and memory have dedicated options:

pyspark --master yarn-client --conf spark.shuffle.spill=true --conf
spark.yarn.executor.memoryOverhead=512M --driver-memory 3G
--executor-memory 5G

-Sandy


On Tue, Sep 16, 2014 at 6:22 PM, Dimension Data, LLC. <
subscriptions@didata.us> wrote:

>
>
> Hello friends:
>
> Yesterday I compiled Spark 1.1.0 against CDH5's Hadoop/YARN distribution.
> Everything went fine, and everything seems
> to work, but for the following.
>
> Following are two invocations of the 'pyspark' script, one with enclosing
> quotes around the options passed to
> '--driver-java-options', and one without them. I added the following
> one-line in the 'pyspark' script to
> show my problem...
>
> ADDED: echo "xxx${PYSPARK_SUBMIT_ARGS}xxx" # Added after the line that
> exports this variable.
>
> =========================================================
>
> FIRST:
> [ without enclosing quotes ]:
>
>     user@linux$ pyspark --master yarn-client --driver-java-options
> -Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M
> -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3
> -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar
> xxx --master yarn-client --driver-java-options
> -Dspark.executor.memory=1Gxxx  <--- echo statement show option truncation.
>
> While this succeeds in getting to a pyspark shell prompt (sc), the context
> isn't setup properly because, as seen
> in red above and below, all but the first option took effect. (Note
> spark.executor.memory is correct but that's only because
> my spark defaults coincide with it.)
>
> 14/09/16 17:35:32 INFO yarn.Client:   command: $JAVA_HOME/bin/java -server
> -Xmx512m -Djava.io.tmpdir=$PWD/tmp
> '-Dspark.tachyonStore.folderName=spark-e225c04d-5333-4ca6-9a78-1c3392438d89'
> '-Dspark.serializer.objectStreamReset=100' '-Dspark.executor.memory=1G'
> '-Dspark.rdd.compress=True' '-Dspark.yarn.secondary.jars='
> '-Dspark.submit.pyFiles='
> '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
> '-Dspark.driver.host=dstorm' '-Dspark.driver.appUIHistoryAddress=' '-
> Dspark.app.name=PySparkShell' '-Dspark.driver.appUIAddress=dstorm:4040'
> '-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G'
> '-Dspark.fileserver.uri=http://192.168.0.16:60305'
> '-Dspark.driver.port=44616' '-Dspark.master=yarn-client'
> org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar
> null  --arg  'dstorm:44616' --executor-memory 1024 --executor-cores 1 --num-executors
> 2 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
>
> (Note: I happen to notice that 'spark.driver.memory' is missing as well).
>
> ===========================================
>
> NEXT:
>
> [ So let's try with enclosing quotes ]
>     user@linux$ pyspark --master yarn-client --driver-java-options
> '-Dspark.executor.memory=1G -Dspark.ui.port=8468 -Dspark.driver.memory=512M
> -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3
> -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
> xxx --master yarn-client --driver-java-options "-Dspark.executor.memory=1G
> -Dspark.ui.port=8468 -Dspark.driver.memory=512M
> -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3
> -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar"xxx
>
> While this does have all the options (shown in the red echo output above
> and the command executed below), pyspark invocation fails, indicating
> that the application ended before I got to a shell prompt.
> See below snippet.
>
> 14/09/16 17:44:12 INFO yarn.Client:   command: $JAVA_HOME/bin/java -server
> -Xmx512m -Djava.io.tmpdir=$PWD/tmp
> '-Dspark.tachyonStore.folderName=spark-3b62ece7-a22a-4d0a-b773-1f5601e5eada'
> '-Dspark.executor.memory=1G' '-Dspark.driver.memory=512M'
> '-Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
> '-Dspark.serializer.objectStreamReset=100' '-Dspark.executor.instances=3'
> '-Dspark.rdd.compress=True' '-Dspark.yarn.secondary.jars='
> '-Dspark.submit.pyFiles=' '-Dspark.ui.port=8468'
> '-Dspark.driver.host=dstorm'
> '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer'
> '-Dspark.driver.appUIHistoryAddress=' '-Dspark.app.name=PySparkShell'
> '-Dspark.driver.appUIAddress=dstorm:8468' '-D
> spark.yarn.executor.memoryOverhead=512M'
> '-Dspark.driver.extraJavaOptions=-Dspark.executor.memory=1G
> -Dspark.ui.port=8468 -Dspark.driver.memory=512M
> -Dspark.yarn.executor.memoryOverhead=512M -Dspark.executor.instances=3
> -Dspark.yarn.jar=hdfs://namenode:8020/user/spark/share/lib/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.2.jar'
> '-Dspark.fileserver.uri=http://192.168.0.16:54171'
> '-Dspark.master=yarn-client' '-Dspark.driver.port=58542'
> org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' --jar
> null  --arg  'dstorm:58542' --executor-memory 1024 --executor-cores 1
> --num-executors  3 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
>
>
> [ ... SNIP ... ]
> 4/09/16 17:44:12 INFO cluster.YarnClientSchedulerBackend: Application
> report from ASM:
>      appMasterRpcPort: -1
>      appStartTime: 1410903852044
>      yarnAppState: ACCEPTED
>
> 14/09/16 17:44:13 INFO cluster.YarnClientSchedulerBackend: Application
> report from ASM:
>      appMasterRpcPort: -1
>      appStartTime: 1410903852044
>      yarnAppState: ACCEPTED
>
> 14/09/16 17:44:14 INFO cluster.YarnClientSchedulerBackend: Application
> report from ASM:
>      appMasterRpcPort: -1
>      appStartTime: 1410903852044
>      yarnAppState: ACCEPTED
>
> 14/09/16 17:44:15 INFO cluster.YarnClientSchedulerBackend: Application
> report from ASM:
>      appMasterRpcPort: 0
>      appStartTime: 1410903852044
>      yarnAppState: RUNNING
>
> 14/09/16 17:44:19 ERROR cluster.YarnClientSchedulerBackend: Yarn
> application already ended: FAILED
>
>
> Am I doing something wrong?
>
> Thank you in advance!
> Team didata
>
>
>
>
>