You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Greg Hill <gr...@RACKSPACE.COM> on 2014/09/08 15:59:59 UTC

clarification for some spark on yarn configuration options

Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster or the workers per slave node?

Is spark.executor.instances an actual config option?  I found that in a commit, but it's not in the docs.

What is the difference between spark.yarn.executor.memoryOverhead and spark.executor.memory ?  Same question for the 'driver' variant, but I assume it's the same answer.

Is there a spark.driver.memory option that's undocumented or do you have to use the environment variable SPARK_DRIVER_MEMORY?

What config option or environment variable do I need to set to get pyspark interactive to pick up the yarn class path?  The ones that work for spark-shell and spark-submit don't seem to work for pyspark.

Thanks in advance.

Greg

Re: clarification for some spark on yarn configuration options

Posted by Andrew Or <an...@databricks.com>.
Yes... good find. I have filed a JIRA here:
https://issues.apache.org/jira/browse/SPARK-3661 and will get to fixing it
shortly. Both of these fixes will be available in 1.1.1. Until both of
these are merged in, it appears that the only way you can do it now is
through --driver-memory.

-Andrew

2014-09-23 7:23 GMT-07:00 Greg Hill <gr...@rackspace.com>:

>  Thanks for looking into it.  I'm trying to avoid making the user pass in
> any parameters by configuring it to use the right values for the cluster
> size by default, hence my reliance on the configuration.  I'd rather just
> use spark-defaults.conf than the environment variables, and looking at the
> code you modified, I don't see any place it's picking up
> spark.driver.memory either.  Is that a separate bug?
>
>  Greg
>
>
>   From: Andrew Or <an...@databricks.com>
> Date: Monday, September 22, 2014 8:11 PM
> To: Nishkam Ravi <nr...@cloudera.com>
> Cc: Greg <gr...@rackspace.com>, "user@spark.apache.org" <
> user@spark.apache.org>
>
> Subject: Re: clarification for some spark on yarn configuration options
>
>   Hi Greg,
>
>  From browsing the code quickly I believe SPARK_DRIVER_MEMORY is not
> actually picked up in cluster mode. This is a bug and I have opened a PR to
> fix it: https://github.com/apache/spark/pull/2500.
> For now, please use --driver-memory instead, which should work for both
> client and cluster mode.
>
>  Thanks for pointing this out,
> -Andrew
>
> 2014-09-22 14:04 GMT-07:00 Nishkam Ravi <nr...@cloudera.com>:
>
>> Maybe try --driver-memory if you are using spark-submit?
>>
>>  Thanks,
>> Nishkam
>>
>> On Mon, Sep 22, 2014 at 1:41 PM, Greg Hill <gr...@rackspace.com>
>> wrote:
>>
>>>  Ah, I see.  It turns out that my problem is that that comparison is
>>> ignoring SPARK_DRIVER_MEMORY and comparing to the default of 512m.  Is that
>>> a bug that's since fixed?  I'm on 1.0.1 and using 'yarn-cluster' as the
>>> master.  'yarn-client' seems to pick up the values and works fine.
>>>
>>>  Greg
>>>
>>>   From: Nishkam Ravi <nr...@cloudera.com>
>>> Date: Monday, September 22, 2014 3:30 PM
>>> To: Greg <gr...@rackspace.com>
>>> Cc: Andrew Or <an...@databricks.com>, "user@spark.apache.org" <
>>> user@spark.apache.org>
>>>
>>> Subject: Re: clarification for some spark on yarn configuration options
>>>
>>>   Greg, if you look carefully, the code is enforcing that the
>>> memoryOverhead be lower (and not higher) than spark.driver.memory.
>>>
>>>  Thanks,
>>> Nishkam
>>>
>>> On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill <gr...@rackspace.com>
>>> wrote:
>>>
>>>>  I thought I had this all figured out, but I'm getting some weird
>>>> errors now that I'm attempting to deploy this on production-size servers.
>>>> It's complaining that I'm not allocating enough memory to the
>>>> memoryOverhead values.  I tracked it down to this code:
>>>>
>>>>
>>>> https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70
>>>>
>>>>  Unless I'm reading it wrong, those checks are enforcing that you set
>>>> spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but
>>>> that makes no sense to me since that memory is just supposed to be what
>>>> YARN needs on top of what you're allocating for Spark.  My understanding
>>>> was that the overhead values should be quite a bit lower (and by default
>>>> they are).
>>>>
>>>>  Also, why must the executor be allocated less memory than the
>>>> driver's memory overhead value?
>>>>
>>>>  What am I misunderstanding here?
>>>>
>>>>  Greg
>>>>
>>>>   From: Andrew Or <an...@databricks.com>
>>>> Date: Tuesday, September 9, 2014 5:49 PM
>>>> To: Greg <gr...@rackspace.com>
>>>> Cc: "user@spark.apache.org" <us...@spark.apache.org>
>>>> Subject: Re: clarification for some spark on yarn configuration options
>>>>
>>>>   Hi Greg,
>>>>
>>>>  SPARK_EXECUTOR_INSTANCES is the total number of workers in the
>>>> cluster. The equivalent "spark.executor.instances" is just another way to
>>>> set the same thing in your spark-defaults.conf. Maybe this should be
>>>> documented. :)
>>>>
>>>>  "spark.yarn.executor.memoryOverhead" is just an additional margin
>>>> added to "spark.executor.memory" for the container. In addition to the
>>>> executor's memory, the container in which the executor is launched needs
>>>> some extra memory for system processes, and this is what this "overhead"
>>>> (somewhat of a misnomer) is for. The same goes for the driver equivalent.
>>>>
>>>>  "spark.driver.memory" behaves differently depending on which version
>>>> of Spark you are using. If you are using Spark 1.1+ (this was released very
>>>> recently), you can directly set "spark.driver.memory" and this will take
>>>> effect. Otherwise, setting this doesn't actually do anything for client
>>>> deploy mode, and you have two alternatives: (1) set the environment
>>>> variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are
>>>> using Spark submit (or bin/spark-shell, or bin/pyspark, which go through
>>>> bin/spark-submit), pass the "--driver-memory" command line argument.
>>>>
>>>>  If you want your PySpark application (driver) to pick up extra class
>>>> path, you can pass the "--driver-class-path" to Spark submit. If you are
>>>> using Spark 1.1+, you may set "spark.driver.extraClassPath" in your
>>>> spark-defaults.conf. There is also an environment variable you could set
>>>> (SPARK_CLASSPATH), though this is now deprecated.
>>>>
>>>>  Let me know if you have more questions about these options,
>>>> -Andrew
>>>>
>>>>
>>>> 2014-09-08 6:59 GMT-07:00 Greg Hill <gr...@rackspace.com>:
>>>>
>>>>>  Is SPARK_EXECUTOR_INSTANCES the total number of workers in the
>>>>> cluster or the workers per slave node?
>>>>>
>>>>>  Is spark.executor.instances an actual config option?  I found that
>>>>> in a commit, but it's not in the docs.
>>>>>
>>>>>  What is the difference between spark.yarn.executor.memoryOverhead
>>>>> and spark.executor.memory ?  Same question for the 'driver' variant,
>>>>> but I assume it's the same answer.
>>>>>
>>>>>  Is there a spark.driver.memory option that's undocumented or do you
>>>>> have to use the environment variable SPARK_DRIVER_MEMORY?
>>>>>
>>>>>  What config option or environment variable do I need to set to get
>>>>> pyspark interactive to pick up the yarn class path?  The ones that work for
>>>>> spark-shell and spark-submit don't seem to work for pyspark.
>>>>>
>>>>>  Thanks in advance.
>>>>>
>>>>>  Greg
>>>>>
>>>>
>>>>
>>>
>>
>

Re: clarification for some spark on yarn configuration options

Posted by Greg Hill <gr...@RACKSPACE.COM>.
Thanks for looking into it.  I'm trying to avoid making the user pass in any parameters by configuring it to use the right values for the cluster size by default, hence my reliance on the configuration.  I'd rather just use spark-defaults.conf than the environment variables, and looking at the code you modified, I don't see any place it's picking up spark.driver.memory either.  Is that a separate bug?

Greg


From: Andrew Or <an...@databricks.com>>
Date: Monday, September 22, 2014 8:11 PM
To: Nishkam Ravi <nr...@cloudera.com>>
Cc: Greg <gr...@rackspace.com>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: clarification for some spark on yarn configuration options

Hi Greg,

>From browsing the code quickly I believe SPARK_DRIVER_MEMORY is not actually picked up in cluster mode. This is a bug and I have opened a PR to fix it: https://github.com/apache/spark/pull/2500.
For now, please use --driver-memory instead, which should work for both client and cluster mode.

Thanks for pointing this out,
-Andrew

2014-09-22 14:04 GMT-07:00 Nishkam Ravi <nr...@cloudera.com>>:
Maybe try --driver-memory if you are using spark-submit?

Thanks,
Nishkam

On Mon, Sep 22, 2014 at 1:41 PM, Greg Hill <gr...@rackspace.com>> wrote:
Ah, I see.  It turns out that my problem is that that comparison is ignoring SPARK_DRIVER_MEMORY and comparing to the default of 512m.  Is that a bug that's since fixed?  I'm on 1.0.1 and using 'yarn-cluster' as the master.  'yarn-client' seems to pick up the values and works fine.

Greg

From: Nishkam Ravi <nr...@cloudera.com>>
Date: Monday, September 22, 2014 3:30 PM
To: Greg <gr...@rackspace.com>>
Cc: Andrew Or <an...@databricks.com>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>

Subject: Re: clarification for some spark on yarn configuration options

Greg, if you look carefully, the code is enforcing that the memoryOverhead be lower (and not higher) than spark.driver.memory.

Thanks,
Nishkam

On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill <gr...@rackspace.com>> wrote:
I thought I had this all figured out, but I'm getting some weird errors now that I'm attempting to deploy this on production-size servers.  It's complaining that I'm not allocating enough memory to the memoryOverhead values.  I tracked it down to this code:

https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70

Unless I'm reading it wrong, those checks are enforcing that you set spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but that makes no sense to me since that memory is just supposed to be what YARN needs on top of what you're allocating for Spark.  My understanding was that the overhead values should be quite a bit lower (and by default they are).

Also, why must the executor be allocated less memory than the driver's memory overhead value?

What am I misunderstanding here?

Greg

From: Andrew Or <an...@databricks.com>>
Date: Tuesday, September 9, 2014 5:49 PM
To: Greg <gr...@rackspace.com>>
Cc: "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: clarification for some spark on yarn configuration options

Hi Greg,

SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The equivalent "spark.executor.instances" is just another way to set the same thing in your spark-defaults.conf. Maybe this should be documented. :)

"spark.yarn.executor.memoryOverhead" is just an additional margin added to "spark.executor.memory" for the container. In addition to the executor's memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this "overhead" (somewhat of a misnomer) is for. The same goes for the driver equivalent.

"spark.driver.memory" behaves differently depending on which version of Spark you are using. If you are using Spark 1.1+ (this was released very recently), you can directly set "spark.driver.memory" and this will take effect. Otherwise, setting this doesn't actually do anything for client deploy mode, and you have two alternatives: (1) set the environment variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are using Spark submit (or bin/spark-shell, or bin/pyspark, which go through bin/spark-submit), pass the "--driver-memory" command line argument.

If you want your PySpark application (driver) to pick up extra class path, you can pass the "--driver-class-path" to Spark submit. If you are using Spark 1.1+, you may set "spark.driver.extraClassPath" in your spark-defaults.conf. There is also an environment variable you could set (SPARK_CLASSPATH), though this is now deprecated.

Let me know if you have more questions about these options,
-Andrew


2014-09-08 6:59 GMT-07:00 Greg Hill <gr...@rackspace.com>>:
Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster or the workers per slave node?

Is spark.executor.instances an actual config option?  I found that in a commit, but it's not in the docs.

What is the difference between spark.yarn.executor.memoryOverhead and spark.executor.memory ?  Same question for the 'driver' variant, but I assume it's the same answer.

Is there a spark.driver.memory option that's undocumented or do you have to use the environment variable SPARK_DRIVER_MEMORY?

What config option or environment variable do I need to set to get pyspark interactive to pick up the yarn class path?  The ones that work for spark-shell and spark-submit don't seem to work for pyspark.

Thanks in advance.

Greg





Re: clarification for some spark on yarn configuration options

Posted by Andrew Or <an...@databricks.com>.
Hi Greg,

>From browsing the code quickly I believe SPARK_DRIVER_MEMORY is not
actually picked up in cluster mode. This is a bug and I have opened a PR to
fix it: https://github.com/apache/spark/pull/2500.
For now, please use --driver-memory instead, which should work for both
client and cluster mode.

Thanks for pointing this out,
-Andrew

2014-09-22 14:04 GMT-07:00 Nishkam Ravi <nr...@cloudera.com>:

> Maybe try --driver-memory if you are using spark-submit?
>
> Thanks,
> Nishkam
>
> On Mon, Sep 22, 2014 at 1:41 PM, Greg Hill <gr...@rackspace.com>
> wrote:
>
>>  Ah, I see.  It turns out that my problem is that that comparison is
>> ignoring SPARK_DRIVER_MEMORY and comparing to the default of 512m.  Is that
>> a bug that's since fixed?  I'm on 1.0.1 and using 'yarn-cluster' as the
>> master.  'yarn-client' seems to pick up the values and works fine.
>>
>>  Greg
>>
>>   From: Nishkam Ravi <nr...@cloudera.com>
>> Date: Monday, September 22, 2014 3:30 PM
>> To: Greg <gr...@rackspace.com>
>> Cc: Andrew Or <an...@databricks.com>, "user@spark.apache.org" <
>> user@spark.apache.org>
>>
>> Subject: Re: clarification for some spark on yarn configuration options
>>
>>   Greg, if you look carefully, the code is enforcing that the
>> memoryOverhead be lower (and not higher) than spark.driver.memory.
>>
>>  Thanks,
>> Nishkam
>>
>> On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill <gr...@rackspace.com>
>> wrote:
>>
>>>  I thought I had this all figured out, but I'm getting some weird
>>> errors now that I'm attempting to deploy this on production-size servers.
>>> It's complaining that I'm not allocating enough memory to the
>>> memoryOverhead values.  I tracked it down to this code:
>>>
>>>
>>> https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70
>>>
>>>  Unless I'm reading it wrong, those checks are enforcing that you set
>>> spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but
>>> that makes no sense to me since that memory is just supposed to be what
>>> YARN needs on top of what you're allocating for Spark.  My understanding
>>> was that the overhead values should be quite a bit lower (and by default
>>> they are).
>>>
>>>  Also, why must the executor be allocated less memory than the driver's
>>> memory overhead value?
>>>
>>>  What am I misunderstanding here?
>>>
>>>  Greg
>>>
>>>   From: Andrew Or <an...@databricks.com>
>>> Date: Tuesday, September 9, 2014 5:49 PM
>>> To: Greg <gr...@rackspace.com>
>>> Cc: "user@spark.apache.org" <us...@spark.apache.org>
>>> Subject: Re: clarification for some spark on yarn configuration options
>>>
>>>   Hi Greg,
>>>
>>>  SPARK_EXECUTOR_INSTANCES is the total number of workers in the
>>> cluster. The equivalent "spark.executor.instances" is just another way to
>>> set the same thing in your spark-defaults.conf. Maybe this should be
>>> documented. :)
>>>
>>>  "spark.yarn.executor.memoryOverhead" is just an additional margin
>>> added to "spark.executor.memory" for the container. In addition to the
>>> executor's memory, the container in which the executor is launched needs
>>> some extra memory for system processes, and this is what this "overhead"
>>> (somewhat of a misnomer) is for. The same goes for the driver equivalent.
>>>
>>>  "spark.driver.memory" behaves differently depending on which version
>>> of Spark you are using. If you are using Spark 1.1+ (this was released very
>>> recently), you can directly set "spark.driver.memory" and this will take
>>> effect. Otherwise, setting this doesn't actually do anything for client
>>> deploy mode, and you have two alternatives: (1) set the environment
>>> variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are
>>> using Spark submit (or bin/spark-shell, or bin/pyspark, which go through
>>> bin/spark-submit), pass the "--driver-memory" command line argument.
>>>
>>>  If you want your PySpark application (driver) to pick up extra class
>>> path, you can pass the "--driver-class-path" to Spark submit. If you are
>>> using Spark 1.1+, you may set "spark.driver.extraClassPath" in your
>>> spark-defaults.conf. There is also an environment variable you could set
>>> (SPARK_CLASSPATH), though this is now deprecated.
>>>
>>>  Let me know if you have more questions about these options,
>>> -Andrew
>>>
>>>
>>> 2014-09-08 6:59 GMT-07:00 Greg Hill <gr...@rackspace.com>:
>>>
>>>>  Is SPARK_EXECUTOR_INSTANCES the total number of workers in the
>>>> cluster or the workers per slave node?
>>>>
>>>>  Is spark.executor.instances an actual config option?  I found that in
>>>> a commit, but it's not in the docs.
>>>>
>>>>  What is the difference between spark.yarn.executor.memoryOverhead and spark.executor.memory
>>>> ?  Same question for the 'driver' variant, but I assume it's the same
>>>> answer.
>>>>
>>>>  Is there a spark.driver.memory option that's undocumented or do you
>>>> have to use the environment variable SPARK_DRIVER_MEMORY?
>>>>
>>>>  What config option or environment variable do I need to set to get
>>>> pyspark interactive to pick up the yarn class path?  The ones that work for
>>>> spark-shell and spark-submit don't seem to work for pyspark.
>>>>
>>>>  Thanks in advance.
>>>>
>>>>  Greg
>>>>
>>>
>>>
>>
>

Re: clarification for some spark on yarn configuration options

Posted by Nishkam Ravi <nr...@cloudera.com>.
Maybe try --driver-memory if you are using spark-submit?

Thanks,
Nishkam

On Mon, Sep 22, 2014 at 1:41 PM, Greg Hill <gr...@rackspace.com> wrote:

>  Ah, I see.  It turns out that my problem is that that comparison is
> ignoring SPARK_DRIVER_MEMORY and comparing to the default of 512m.  Is that
> a bug that's since fixed?  I'm on 1.0.1 and using 'yarn-cluster' as the
> master.  'yarn-client' seems to pick up the values and works fine.
>
>  Greg
>
>   From: Nishkam Ravi <nr...@cloudera.com>
> Date: Monday, September 22, 2014 3:30 PM
> To: Greg <gr...@rackspace.com>
> Cc: Andrew Or <an...@databricks.com>, "user@spark.apache.org" <
> user@spark.apache.org>
>
> Subject: Re: clarification for some spark on yarn configuration options
>
>   Greg, if you look carefully, the code is enforcing that the
> memoryOverhead be lower (and not higher) than spark.driver.memory.
>
>  Thanks,
> Nishkam
>
> On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill <gr...@rackspace.com>
> wrote:
>
>>  I thought I had this all figured out, but I'm getting some weird errors
>> now that I'm attempting to deploy this on production-size servers.  It's
>> complaining that I'm not allocating enough memory to the memoryOverhead
>> values.  I tracked it down to this code:
>>
>>
>> https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70
>>
>>  Unless I'm reading it wrong, those checks are enforcing that you set
>> spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but
>> that makes no sense to me since that memory is just supposed to be what
>> YARN needs on top of what you're allocating for Spark.  My understanding
>> was that the overhead values should be quite a bit lower (and by default
>> they are).
>>
>>  Also, why must the executor be allocated less memory than the driver's
>> memory overhead value?
>>
>>  What am I misunderstanding here?
>>
>>  Greg
>>
>>   From: Andrew Or <an...@databricks.com>
>> Date: Tuesday, September 9, 2014 5:49 PM
>> To: Greg <gr...@rackspace.com>
>> Cc: "user@spark.apache.org" <us...@spark.apache.org>
>> Subject: Re: clarification for some spark on yarn configuration options
>>
>>   Hi Greg,
>>
>>  SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster.
>> The equivalent "spark.executor.instances" is just another way to set the
>> same thing in your spark-defaults.conf. Maybe this should be documented. :)
>>
>>  "spark.yarn.executor.memoryOverhead" is just an additional margin added
>> to "spark.executor.memory" for the container. In addition to the executor's
>> memory, the container in which the executor is launched needs some extra
>> memory for system processes, and this is what this "overhead" (somewhat of
>> a misnomer) is for. The same goes for the driver equivalent.
>>
>>  "spark.driver.memory" behaves differently depending on which version of
>> Spark you are using. If you are using Spark 1.1+ (this was released very
>> recently), you can directly set "spark.driver.memory" and this will take
>> effect. Otherwise, setting this doesn't actually do anything for client
>> deploy mode, and you have two alternatives: (1) set the environment
>> variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are
>> using Spark submit (or bin/spark-shell, or bin/pyspark, which go through
>> bin/spark-submit), pass the "--driver-memory" command line argument.
>>
>>  If you want your PySpark application (driver) to pick up extra class
>> path, you can pass the "--driver-class-path" to Spark submit. If you are
>> using Spark 1.1+, you may set "spark.driver.extraClassPath" in your
>> spark-defaults.conf. There is also an environment variable you could set
>> (SPARK_CLASSPATH), though this is now deprecated.
>>
>>  Let me know if you have more questions about these options,
>> -Andrew
>>
>>
>> 2014-09-08 6:59 GMT-07:00 Greg Hill <gr...@rackspace.com>:
>>
>>>  Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster
>>> or the workers per slave node?
>>>
>>>  Is spark.executor.instances an actual config option?  I found that in
>>> a commit, but it's not in the docs.
>>>
>>>  What is the difference between spark.yarn.executor.memoryOverhead and spark.executor.memory
>>> ?  Same question for the 'driver' variant, but I assume it's the same
>>> answer.
>>>
>>>  Is there a spark.driver.memory option that's undocumented or do you
>>> have to use the environment variable SPARK_DRIVER_MEMORY?
>>>
>>>  What config option or environment variable do I need to set to get
>>> pyspark interactive to pick up the yarn class path?  The ones that work for
>>> spark-shell and spark-submit don't seem to work for pyspark.
>>>
>>>  Thanks in advance.
>>>
>>>  Greg
>>>
>>
>>
>

Re: clarification for some spark on yarn configuration options

Posted by Greg Hill <gr...@RACKSPACE.COM>.
Ah, I see.  It turns out that my problem is that that comparison is ignoring SPARK_DRIVER_MEMORY and comparing to the default of 512m.  Is that a bug that's since fixed?  I'm on 1.0.1 and using 'yarn-cluster' as the master.  'yarn-client' seems to pick up the values and works fine.

Greg

From: Nishkam Ravi <nr...@cloudera.com>>
Date: Monday, September 22, 2014 3:30 PM
To: Greg <gr...@rackspace.com>>
Cc: Andrew Or <an...@databricks.com>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: clarification for some spark on yarn configuration options

Greg, if you look carefully, the code is enforcing that the memoryOverhead be lower (and not higher) than spark.driver.memory.

Thanks,
Nishkam

On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill <gr...@rackspace.com>> wrote:
I thought I had this all figured out, but I'm getting some weird errors now that I'm attempting to deploy this on production-size servers.  It's complaining that I'm not allocating enough memory to the memoryOverhead values.  I tracked it down to this code:

https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70

Unless I'm reading it wrong, those checks are enforcing that you set spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but that makes no sense to me since that memory is just supposed to be what YARN needs on top of what you're allocating for Spark.  My understanding was that the overhead values should be quite a bit lower (and by default they are).

Also, why must the executor be allocated less memory than the driver's memory overhead value?

What am I misunderstanding here?

Greg

From: Andrew Or <an...@databricks.com>>
Date: Tuesday, September 9, 2014 5:49 PM
To: Greg <gr...@rackspace.com>>
Cc: "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: clarification for some spark on yarn configuration options

Hi Greg,

SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The equivalent "spark.executor.instances" is just another way to set the same thing in your spark-defaults.conf. Maybe this should be documented. :)

"spark.yarn.executor.memoryOverhead" is just an additional margin added to "spark.executor.memory" for the container. In addition to the executor's memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this "overhead" (somewhat of a misnomer) is for. The same goes for the driver equivalent.

"spark.driver.memory" behaves differently depending on which version of Spark you are using. If you are using Spark 1.1+ (this was released very recently), you can directly set "spark.driver.memory" and this will take effect. Otherwise, setting this doesn't actually do anything for client deploy mode, and you have two alternatives: (1) set the environment variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are using Spark submit (or bin/spark-shell, or bin/pyspark, which go through bin/spark-submit), pass the "--driver-memory" command line argument.

If you want your PySpark application (driver) to pick up extra class path, you can pass the "--driver-class-path" to Spark submit. If you are using Spark 1.1+, you may set "spark.driver.extraClassPath" in your spark-defaults.conf. There is also an environment variable you could set (SPARK_CLASSPATH), though this is now deprecated.

Let me know if you have more questions about these options,
-Andrew


2014-09-08 6:59 GMT-07:00 Greg Hill <gr...@rackspace.com>>:
Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster or the workers per slave node?

Is spark.executor.instances an actual config option?  I found that in a commit, but it's not in the docs.

What is the difference between spark.yarn.executor.memoryOverhead and spark.executor.memory ?  Same question for the 'driver' variant, but I assume it's the same answer.

Is there a spark.driver.memory option that's undocumented or do you have to use the environment variable SPARK_DRIVER_MEMORY?

What config option or environment variable do I need to set to get pyspark interactive to pick up the yarn class path?  The ones that work for spark-shell and spark-submit don't seem to work for pyspark.

Thanks in advance.

Greg



Re: clarification for some spark on yarn configuration options

Posted by Nishkam Ravi <nr...@cloudera.com>.
Greg, if you look carefully, the code is enforcing that the memoryOverhead
be lower (and not higher) than spark.driver.memory.

Thanks,
Nishkam

On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill <gr...@rackspace.com> wrote:

>  I thought I had this all figured out, but I'm getting some weird errors
> now that I'm attempting to deploy this on production-size servers.  It's
> complaining that I'm not allocating enough memory to the memoryOverhead
> values.  I tracked it down to this code:
>
>
> https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70
>
>  Unless I'm reading it wrong, those checks are enforcing that you set
> spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but
> that makes no sense to me since that memory is just supposed to be what
> YARN needs on top of what you're allocating for Spark.  My understanding
> was that the overhead values should be quite a bit lower (and by default
> they are).
>
>  Also, why must the executor be allocated less memory than the driver's
> memory overhead value?
>
>  What am I misunderstanding here?
>
>  Greg
>
>   From: Andrew Or <an...@databricks.com>
> Date: Tuesday, September 9, 2014 5:49 PM
> To: Greg <gr...@rackspace.com>
> Cc: "user@spark.apache.org" <us...@spark.apache.org>
> Subject: Re: clarification for some spark on yarn configuration options
>
>   Hi Greg,
>
>  SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster.
> The equivalent "spark.executor.instances" is just another way to set the
> same thing in your spark-defaults.conf. Maybe this should be documented. :)
>
>  "spark.yarn.executor.memoryOverhead" is just an additional margin added
> to "spark.executor.memory" for the container. In addition to the executor's
> memory, the container in which the executor is launched needs some extra
> memory for system processes, and this is what this "overhead" (somewhat of
> a misnomer) is for. The same goes for the driver equivalent.
>
>  "spark.driver.memory" behaves differently depending on which version of
> Spark you are using. If you are using Spark 1.1+ (this was released very
> recently), you can directly set "spark.driver.memory" and this will take
> effect. Otherwise, setting this doesn't actually do anything for client
> deploy mode, and you have two alternatives: (1) set the environment
> variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are
> using Spark submit (or bin/spark-shell, or bin/pyspark, which go through
> bin/spark-submit), pass the "--driver-memory" command line argument.
>
>  If you want your PySpark application (driver) to pick up extra class
> path, you can pass the "--driver-class-path" to Spark submit. If you are
> using Spark 1.1+, you may set "spark.driver.extraClassPath" in your
> spark-defaults.conf. There is also an environment variable you could set
> (SPARK_CLASSPATH), though this is now deprecated.
>
>  Let me know if you have more questions about these options,
> -Andrew
>
>
> 2014-09-08 6:59 GMT-07:00 Greg Hill <gr...@rackspace.com>:
>
>>  Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster
>> or the workers per slave node?
>>
>>  Is spark.executor.instances an actual config option?  I found that in a
>> commit, but it's not in the docs.
>>
>>  What is the difference between spark.yarn.executor.memoryOverhead and spark.executor.memory
>> ?  Same question for the 'driver' variant, but I assume it's the same
>> answer.
>>
>>  Is there a spark.driver.memory option that's undocumented or do you
>> have to use the environment variable SPARK_DRIVER_MEMORY?
>>
>>  What config option or environment variable do I need to set to get
>> pyspark interactive to pick up the yarn class path?  The ones that work for
>> spark-shell and spark-submit don't seem to work for pyspark.
>>
>>  Thanks in advance.
>>
>>  Greg
>>
>
>

Re: clarification for some spark on yarn configuration options

Posted by Greg Hill <gr...@RACKSPACE.COM>.
Gah, ignore me again.  I was reading the logic backwards.  For some reason it isn't picking up my SPARK_DRIVER_MEMORY environment variable and is using the default of 512m.  Probably an environmental issue.

Greg

From: Greg <gr...@rackspace.com>>
Date: Monday, September 22, 2014 3:26 PM
To: Andrew Or <an...@databricks.com>>
Cc: "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: clarification for some spark on yarn configuration options

I thought I had this all figured out, but I'm getting some weird errors now that I'm attempting to deploy this on production-size servers.  It's complaining that I'm not allocating enough memory to the memoryOverhead values.  I tracked it down to this code:

https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70

Unless I'm reading it wrong, those checks are enforcing that you set spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but that makes no sense to me since that memory is just supposed to be what YARN needs on top of what you're allocating for Spark.  My understanding was that the overhead values should be quite a bit lower (and by default they are).

Also, why must the executor be allocated less memory than the driver's memory overhead value?

What am I misunderstanding here?

Greg

From: Andrew Or <an...@databricks.com>>
Date: Tuesday, September 9, 2014 5:49 PM
To: Greg <gr...@rackspace.com>>
Cc: "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: clarification for some spark on yarn configuration options

Hi Greg,

SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The equivalent "spark.executor.instances" is just another way to set the same thing in your spark-defaults.conf. Maybe this should be documented. :)

"spark.yarn.executor.memoryOverhead" is just an additional margin added to "spark.executor.memory" for the container. In addition to the executor's memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this "overhead" (somewhat of a misnomer) is for. The same goes for the driver equivalent.

"spark.driver.memory" behaves differently depending on which version of Spark you are using. If you are using Spark 1.1+ (this was released very recently), you can directly set "spark.driver.memory" and this will take effect. Otherwise, setting this doesn't actually do anything for client deploy mode, and you have two alternatives: (1) set the environment variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are using Spark submit (or bin/spark-shell, or bin/pyspark, which go through bin/spark-submit), pass the "--driver-memory" command line argument.

If you want your PySpark application (driver) to pick up extra class path, you can pass the "--driver-class-path" to Spark submit. If you are using Spark 1.1+, you may set "spark.driver.extraClassPath" in your spark-defaults.conf. There is also an environment variable you could set (SPARK_CLASSPATH), though this is now deprecated.

Let me know if you have more questions about these options,
-Andrew


2014-09-08 6:59 GMT-07:00 Greg Hill <gr...@rackspace.com>>:
Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster or the workers per slave node?

Is spark.executor.instances an actual config option?  I found that in a commit, but it's not in the docs.

What is the difference between spark.yarn.executor.memoryOverhead and spark.executor.memory ?  Same question for the 'driver' variant, but I assume it's the same answer.

Is there a spark.driver.memory option that's undocumented or do you have to use the environment variable SPARK_DRIVER_MEMORY?

What config option or environment variable do I need to set to get pyspark interactive to pick up the yarn class path?  The ones that work for spark-shell and spark-submit don't seem to work for pyspark.

Thanks in advance.

Greg


Re: clarification for some spark on yarn configuration options

Posted by Greg Hill <gr...@RACKSPACE.COM>.
I thought I had this all figured out, but I'm getting some weird errors now that I'm attempting to deploy this on production-size servers.  It's complaining that I'm not allocating enough memory to the memoryOverhead values.  I tracked it down to this code:

https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70

Unless I'm reading it wrong, those checks are enforcing that you set spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but that makes no sense to me since that memory is just supposed to be what YARN needs on top of what you're allocating for Spark.  My understanding was that the overhead values should be quite a bit lower (and by default they are).

Also, why must the executor be allocated less memory than the driver's memory overhead value?

What am I misunderstanding here?

Greg

From: Andrew Or <an...@databricks.com>>
Date: Tuesday, September 9, 2014 5:49 PM
To: Greg <gr...@rackspace.com>>
Cc: "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: clarification for some spark on yarn configuration options

Hi Greg,

SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The equivalent "spark.executor.instances" is just another way to set the same thing in your spark-defaults.conf. Maybe this should be documented. :)

"spark.yarn.executor.memoryOverhead" is just an additional margin added to "spark.executor.memory" for the container. In addition to the executor's memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this "overhead" (somewhat of a misnomer) is for. The same goes for the driver equivalent.

"spark.driver.memory" behaves differently depending on which version of Spark you are using. If you are using Spark 1.1+ (this was released very recently), you can directly set "spark.driver.memory" and this will take effect. Otherwise, setting this doesn't actually do anything for client deploy mode, and you have two alternatives: (1) set the environment variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are using Spark submit (or bin/spark-shell, or bin/pyspark, which go through bin/spark-submit), pass the "--driver-memory" command line argument.

If you want your PySpark application (driver) to pick up extra class path, you can pass the "--driver-class-path" to Spark submit. If you are using Spark 1.1+, you may set "spark.driver.extraClassPath" in your spark-defaults.conf. There is also an environment variable you could set (SPARK_CLASSPATH), though this is now deprecated.

Let me know if you have more questions about these options,
-Andrew


2014-09-08 6:59 GMT-07:00 Greg Hill <gr...@rackspace.com>>:
Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster or the workers per slave node?

Is spark.executor.instances an actual config option?  I found that in a commit, but it's not in the docs.

What is the difference between spark.yarn.executor.memoryOverhead and spark.executor.memory ?  Same question for the 'driver' variant, but I assume it's the same answer.

Is there a spark.driver.memory option that's undocumented or do you have to use the environment variable SPARK_DRIVER_MEMORY?

What config option or environment variable do I need to set to get pyspark interactive to pick up the yarn class path?  The ones that work for spark-shell and spark-submit don't seem to work for pyspark.

Thanks in advance.

Greg


Re: clarification for some spark on yarn configuration options

Posted by Andrew Or <an...@databricks.com>.
Hi Greg,

SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The
equivalent "spark.executor.instances" is just another way to set the same
thing in your spark-defaults.conf. Maybe this should be documented. :)

"spark.yarn.executor.memoryOverhead" is just an additional margin added to
"spark.executor.memory" for the container. In addition to the executor's
memory, the container in which the executor is launched needs some extra
memory for system processes, and this is what this "overhead" (somewhat of
a misnomer) is for. The same goes for the driver equivalent.

"spark.driver.memory" behaves differently depending on which version of
Spark you are using. If you are using Spark 1.1+ (this was released very
recently), you can directly set "spark.driver.memory" and this will take
effect. Otherwise, setting this doesn't actually do anything for client
deploy mode, and you have two alternatives: (1) set the environment
variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are
using Spark submit (or bin/spark-shell, or bin/pyspark, which go through
bin/spark-submit), pass the "--driver-memory" command line argument.

If you want your PySpark application (driver) to pick up extra class path,
you can pass the "--driver-class-path" to Spark submit. If you are using
Spark 1.1+, you may set "spark.driver.extraClassPath" in your
spark-defaults.conf. There is also an environment variable you could set
(SPARK_CLASSPATH), though this is now deprecated.

Let me know if you have more questions about these options,
-Andrew


2014-09-08 6:59 GMT-07:00 Greg Hill <gr...@rackspace.com>:

>  Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster
> or the workers per slave node?
>
>  Is spark.executor.instances an actual config option?  I found that in a
> commit, but it's not in the docs.
>
>  What is the difference between spark.yarn.executor.memoryOverhead and spark.executor.memory
> ?  Same question for the 'driver' variant, but I assume it's the same
> answer.
>
>  Is there a spark.driver.memory option that's undocumented or do you have
> to use the environment variable SPARK_DRIVER_MEMORY?
>
>  What config option or environment variable do I need to set to get
> pyspark interactive to pick up the yarn class path?  The ones that work for
> spark-shell and spark-submit don't seem to work for pyspark.
>
>  Thanks in advance.
>
>  Greg
>