You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Dale Richardson <da...@hotmail.com> on 2015/03/13 11:07:57 UTC

Spark config option 'expression language' feedback request











PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature to allow for Spark configuration options (whether on command line, environment variable or a configuration file) to be specified via a simple expression language.


Such a feature has the following end-user benefits:
- Allows for the flexibility in specifying time intervals or byte quantities in appropriate and easy to follow units e.g. 1 week rather rather then 604800 seconds

- Allows for the scaling of a configuration option in relation to a system attributes. e.g.

SPARK_WORKER_CORES = numCores - 1

SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB

- Gives the ability to scale multiple configuration options together eg:

spark.driver.memory = 0.75 * physicalMemoryBytes

spark.driver.maxResultSize = spark.driver.memory * 0.8


The following functions are currently supported by this PR:
NumCores:             Number of cores assigned to the JVM (usually == Physical machine cores)
PhysicalMemoryBytes:  Memory size of hosting machine

JVMTotalMemoryBytes:  Current bytes of memory allocated to the JVM

JVMMaxMemoryBytes:    Maximum number of bytes of memory available to the JVM

JVMFreeMemoryBytes:   maxMemoryBytes - totalMemoryBytes


I was wondering if anybody on the mailing list has any further ideas on other functions that could be useful to have when specifying spark configuration options?
Regards,Dale.

Re: Spark config option 'expression language' feedback request

Posted by Mike Hynes <91...@gmail.com>.

Hi,
This is just a thought from my experience setting up Spark to run on a
linux cluster. I found it a bit unusual that some parameters could be
specified as command line args to spark-submit, others as env variables,
and some in a configuration file. What I ended up doing was writing my own
bash script that exported all the variables and other scripts to call
spark-submit with the arguments I wanted.

I think that the "expressive language" idea would be doable by using an
entirely env variable based approach, or as commandline parameters. That
way there is only one configuration, which is easily scriptable,  and you
are still able to express relations like:
spark.driver.maxResultSize = spark.driver.memory * 0.8
in your config as
export SPARK_DRIVER_MAXRESULTSIZE = $(bc -l <<< "0.8 *
$SPARK_DRIVER_MEMORY")

It may not look as nice, but it does allow for everything to be in one
place, and to have separate config files for certain jobs. Admittedly, if
you want something like 0.8 * 2G, you first write a bash function to expand
all the "G M k" symbols,  but that's not too painful.
On Mar 31, 2015 2:39 AM, "Reynold Xin" <rx...@databricks.com> wrote:

> Reviving this to see if others would like to chime in about this
> "expression language" for config options.
>
>
> On Fri, Mar 13, 2015 at 7:57 PM, Dale Richardson <da...@hotmail.com>
> wrote:
>
> > Mridul,I may have added some confusion by giving examples in completely
> > different areas. For example the number of cores available for tasking on
> > each worker machine is a resource-controller level configuration
> variable.
> > In standalone mode (ie using Spark's home-grown resource manager) the
> > configuration variable SPARK_WORKER_CORES is an item that spark admins
> can
> > set (and we can use expressions for). The equivalent variable for YARN
> > (Yarn.nodemanager.resource.cpu-vcores) is only used by Yarn's node
> manager
> > setup and is set by Yarn administrators and outside of control of spark
> > (and most users).  If you are not a cluster administrator then both
> > variables are irrelevant to you. The same goes for SPARK_WORKER_MEMORY.
> >
> > As for spark.executor.memory,  As there is no way to know the attributes
> > of a machine before a task is allocated to it, we cannot use any of the
> > JVMInfo functions. For options like that the expression parser can easily
> > be limited to supporting different byte units of scale (kb/mb/gb etc) and
> > other configuration variables only.
> > Regards,Dale.
> >
> >
> >
> >
> > > Date: Fri, 13 Mar 2015 17:30:51 -0700
> > > Subject: Re: Spark config option 'expression language' feedback request
> > > From: mridul@gmail.com
> > > To: dale__r@hotmail.com
> > > CC: dev@spark.apache.org
> > >
> > > Let me try to rephrase my query.
> > > How can a user specify, for example, what the executor memory should
> > > be or number of cores should be.
> > >
> > > I dont want a situation where some variables can be specified using
> > > one set of idioms (from this PR for example) and another set cannot
> > > be.
> > >
> > >
> > > Regards,
> > > Mridul
> > >
> > >
> > >
> > >
> > > On Fri, Mar 13, 2015 at 4:06 PM, Dale Richardson <da...@hotmail.com>
> > wrote:
> > > >
> > > >
> > > >
> > > > Thanks for your questions Mridul.
> > > > I assume you are referring to how the functionality to query system
> > state works in Yarn and Mesos?
> > > > The API's used are the standard JVM API's so the functionality will
> > work without change. There is no real use case for using
> > 'physicalMemoryBytes' in these cases though, as the JVM size has already
> > been limited by the resource manager.
> > > > Regards,Dale.
> > > >> Date: Fri, 13 Mar 2015 08:20:33 -0700
> > > >> Subject: Re: Spark config option 'expression language' feedback
> > request
> > > >> From: mridul@gmail.com
> > > >> To: dale__r@hotmail.com
> > > >> CC: dev@spark.apache.org
> > > >>
> > > >> I am curious how you are going to support these over mesos and yarn.
> > > >> Any configure change like this should be applicable to all of them,
> > not
> > > >> just local and standalone modes.
> > > >>
> > > >> Regards
> > > >> Mridul
> > > >>
> > > >> On Friday, March 13, 2015, Dale Richardson <da...@hotmail.com>
> > wrote:
> > > >>
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature
> > to
> > > >> > allow for Spark configuration options (whether on command line,
> > environment
> > > >> > variable or a configuration file) to be specified via a simple
> > expression
> > > >> > language.
> > > >> >
> > > >> >
> > > >> > Such a feature has the following end-user benefits:
> > > >> > - Allows for the flexibility in specifying time intervals or byte
> > > >> > quantities in appropriate and easy to follow units e.g. 1 week
> > rather
> > > >> > rather then 604800 seconds
> > > >> >
> > > >> > - Allows for the scaling of a configuration option in relation to
> a
> > system
> > > >> > attributes. e.g.
> > > >> >
> > > >> > SPARK_WORKER_CORES = numCores - 1
> > > >> >
> > > >> > SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB
> > > >> >
> > > >> > - Gives the ability to scale multiple configuration options
> > together eg:
> > > >> >
> > > >> > spark.driver.memory = 0.75 * physicalMemoryBytes
> > > >> >
> > > >> > spark.driver.maxResultSize = spark.driver.memory * 0.8
> > > >> >
> > > >> >
> > > >> > The following functions are currently supported by this PR:
> > > >> > NumCores:             Number of cores assigned to the JVM (usually
> > ==
> > > >> > Physical machine cores)
> > > >> > PhysicalMemoryBytes:  Memory size of hosting machine
> > > >> >
> > > >> > JVMTotalMemoryBytes:  Current bytes of memory allocated to the JVM
> > > >> >
> > > >> > JVMMaxMemoryBytes:    Maximum number of bytes of memory available
> > to the
> > > >> > JVM
> > > >> >
> > > >> > JVMFreeMemoryBytes:   maxMemoryBytes - totalMemoryBytes
> > > >> >
> > > >> >
> > > >> > I was wondering if anybody on the mailing list has any further
> > ideas on
> > > >> > other functions that could be useful to have when specifying spark
> > > >> > configuration options?
> > > >> > Regards,Dale.
> > > >> >
> > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > > For additional commands, e-mail: dev-help@spark.apache.org
> > >
> >
> >
>

Re: Spark config option 'expression language' feedback request

Posted by Reynold Xin <rx...@databricks.com>.

Reviving this to see if others would like to chime in about this
"expression language" for config options.


On Fri, Mar 13, 2015 at 7:57 PM, Dale Richardson <da...@hotmail.com>
wrote:

> Mridul,I may have added some confusion by giving examples in completely
> different areas. For example the number of cores available for tasking on
> each worker machine is a resource-controller level configuration variable.
> In standalone mode (ie using Spark's home-grown resource manager) the
> configuration variable SPARK_WORKER_CORES is an item that spark admins can
> set (and we can use expressions for). The equivalent variable for YARN
> (Yarn.nodemanager.resource.cpu-vcores) is only used by Yarn's node manager
> setup and is set by Yarn administrators and outside of control of spark
> (and most users).  If you are not a cluster administrator then both
> variables are irrelevant to you. The same goes for SPARK_WORKER_MEMORY.
>
> As for spark.executor.memory,  As there is no way to know the attributes
> of a machine before a task is allocated to it, we cannot use any of the
> JVMInfo functions. For options like that the expression parser can easily
> be limited to supporting different byte units of scale (kb/mb/gb etc) and
> other configuration variables only.
> Regards,Dale.
>
>
>
>
> > Date: Fri, 13 Mar 2015 17:30:51 -0700
> > Subject: Re: Spark config option 'expression language' feedback request
> > From: mridul@gmail.com
> > To: dale__r@hotmail.com
> > CC: dev@spark.apache.org
> >
> > Let me try to rephrase my query.
> > How can a user specify, for example, what the executor memory should
> > be or number of cores should be.
> >
> > I dont want a situation where some variables can be specified using
> > one set of idioms (from this PR for example) and another set cannot
> > be.
> >
> >
> > Regards,
> > Mridul
> >
> >
> >
> >
> > On Fri, Mar 13, 2015 at 4:06 PM, Dale Richardson <da...@hotmail.com>
> wrote:
> > >
> > >
> > >
> > > Thanks for your questions Mridul.
> > > I assume you are referring to how the functionality to query system
> state works in Yarn and Mesos?
> > > The API's used are the standard JVM API's so the functionality will
> work without change. There is no real use case for using
> 'physicalMemoryBytes' in these cases though, as the JVM size has already
> been limited by the resource manager.
> > > Regards,Dale.
> > >> Date: Fri, 13 Mar 2015 08:20:33 -0700
> > >> Subject: Re: Spark config option 'expression language' feedback
> request
> > >> From: mridul@gmail.com
> > >> To: dale__r@hotmail.com
> > >> CC: dev@spark.apache.org
> > >>
> > >> I am curious how you are going to support these over mesos and yarn.
> > >> Any configure change like this should be applicable to all of them,
> not
> > >> just local and standalone modes.
> > >>
> > >> Regards
> > >> Mridul
> > >>
> > >> On Friday, March 13, 2015, Dale Richardson <da...@hotmail.com>
> wrote:
> > >>
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature
> to
> > >> > allow for Spark configuration options (whether on command line,
> environment
> > >> > variable or a configuration file) to be specified via a simple
> expression
> > >> > language.
> > >> >
> > >> >
> > >> > Such a feature has the following end-user benefits:
> > >> > - Allows for the flexibility in specifying time intervals or byte
> > >> > quantities in appropriate and easy to follow units e.g. 1 week
> rather
> > >> > rather then 604800 seconds
> > >> >
> > >> > - Allows for the scaling of a configuration option in relation to a
> system
> > >> > attributes. e.g.
> > >> >
> > >> > SPARK_WORKER_CORES = numCores - 1
> > >> >
> > >> > SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB
> > >> >
> > >> > - Gives the ability to scale multiple configuration options
> together eg:
> > >> >
> > >> > spark.driver.memory = 0.75 * physicalMemoryBytes
> > >> >
> > >> > spark.driver.maxResultSize = spark.driver.memory * 0.8
> > >> >
> > >> >
> > >> > The following functions are currently supported by this PR:
> > >> > NumCores:             Number of cores assigned to the JVM (usually
> ==
> > >> > Physical machine cores)
> > >> > PhysicalMemoryBytes:  Memory size of hosting machine
> > >> >
> > >> > JVMTotalMemoryBytes:  Current bytes of memory allocated to the JVM
> > >> >
> > >> > JVMMaxMemoryBytes:    Maximum number of bytes of memory available
> to the
> > >> > JVM
> > >> >
> > >> > JVMFreeMemoryBytes:   maxMemoryBytes - totalMemoryBytes
> > >> >
> > >> >
> > >> > I was wondering if anybody on the mailing list has any further
> ideas on
> > >> > other functions that could be useful to have when specifying spark
> > >> > configuration options?
> > >> > Regards,Dale.
> > >> >
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > For additional commands, e-mail: dev-help@spark.apache.org
> >
>
>

RE: Spark config option 'expression language' feedback request

Posted by Dale Richardson <da...@hotmail.com>.

Mridul,I may have added some confusion by giving examples in completely different areas. For example the number of cores available for tasking on each worker machine is a resource-controller level configuration variable. In standalone mode (ie using Spark's home-grown resource manager) the configuration variable SPARK_WORKER_CORES is an item that spark admins can set (and we can use expressions for). The equivalent variable for YARN (Yarn.nodemanager.resource.cpu-vcores) is only used by Yarn's node manager setup and is set by Yarn administrators and outside of control of spark (and most users).  If you are not a cluster administrator then both variables are irrelevant to you. The same goes for SPARK_WORKER_MEMORY.

As for spark.executor.memory,  As there is no way to know the attributes of a machine before a task is allocated to it, we cannot use any of the JVMInfo functions. For options like that the expression parser can easily be limited to supporting different byte units of scale (kb/mb/gb etc) and other configuration variables only.  
Regards,Dale.




> Date: Fri, 13 Mar 2015 17:30:51 -0700
> Subject: Re: Spark config option 'expression language' feedback request
> From: mridul@gmail.com
> To: dale__r@hotmail.com
> CC: dev@spark.apache.org
> 
> Let me try to rephrase my query.
> How can a user specify, for example, what the executor memory should
> be or number of cores should be.
> 
> I dont want a situation where some variables can be specified using
> one set of idioms (from this PR for example) and another set cannot
> be.
> 
> 
> Regards,
> Mridul
> 
> 
> 
> 
> On Fri, Mar 13, 2015 at 4:06 PM, Dale Richardson <da...@hotmail.com> wrote:
> >
> >
> >
> > Thanks for your questions Mridul.
> > I assume you are referring to how the functionality to query system state works in Yarn and Mesos?
> > The API's used are the standard JVM API's so the functionality will work without change. There is no real use case for using 'physicalMemoryBytes' in these cases though, as the JVM size has already been limited by the resource manager.
> > Regards,Dale.
> >> Date: Fri, 13 Mar 2015 08:20:33 -0700
> >> Subject: Re: Spark config option 'expression language' feedback request
> >> From: mridul@gmail.com
> >> To: dale__r@hotmail.com
> >> CC: dev@spark.apache.org
> >>
> >> I am curious how you are going to support these over mesos and yarn.
> >> Any configure change like this should be applicable to all of them, not
> >> just local and standalone modes.
> >>
> >> Regards
> >> Mridul
> >>
> >> On Friday, March 13, 2015, Dale Richardson <da...@hotmail.com> wrote:
> >>
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature to
> >> > allow for Spark configuration options (whether on command line, environment
> >> > variable or a configuration file) to be specified via a simple expression
> >> > language.
> >> >
> >> >
> >> > Such a feature has the following end-user benefits:
> >> > - Allows for the flexibility in specifying time intervals or byte
> >> > quantities in appropriate and easy to follow units e.g. 1 week rather
> >> > rather then 604800 seconds
> >> >
> >> > - Allows for the scaling of a configuration option in relation to a system
> >> > attributes. e.g.
> >> >
> >> > SPARK_WORKER_CORES = numCores - 1
> >> >
> >> > SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB
> >> >
> >> > - Gives the ability to scale multiple configuration options together eg:
> >> >
> >> > spark.driver.memory = 0.75 * physicalMemoryBytes
> >> >
> >> > spark.driver.maxResultSize = spark.driver.memory * 0.8
> >> >
> >> >
> >> > The following functions are currently supported by this PR:
> >> > NumCores:             Number of cores assigned to the JVM (usually ==
> >> > Physical machine cores)
> >> > PhysicalMemoryBytes:  Memory size of hosting machine
> >> >
> >> > JVMTotalMemoryBytes:  Current bytes of memory allocated to the JVM
> >> >
> >> > JVMMaxMemoryBytes:    Maximum number of bytes of memory available to the
> >> > JVM
> >> >
> >> > JVMFreeMemoryBytes:   maxMemoryBytes - totalMemoryBytes
> >> >
> >> >
> >> > I was wondering if anybody on the mailing list has any further ideas on
> >> > other functions that could be useful to have when specifying spark
> >> > configuration options?
> >> > Regards,Dale.
> >> >
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>

Re: Spark config option 'expression language' feedback request

Posted by Mridul Muralidharan <mr...@gmail.com>.

Let me try to rephrase my query.
How can a user specify, for example, what the executor memory should
be or number of cores should be.

I dont want a situation where some variables can be specified using
one set of idioms (from this PR for example) and another set cannot
be.


Regards,
Mridul




On Fri, Mar 13, 2015 at 4:06 PM, Dale Richardson <da...@hotmail.com> wrote:
>
>
>
> Thanks for your questions Mridul.
> I assume you are referring to how the functionality to query system state works in Yarn and Mesos?
> The API's used are the standard JVM API's so the functionality will work without change. There is no real use case for using 'physicalMemoryBytes' in these cases though, as the JVM size has already been limited by the resource manager.
> Regards,Dale.
>> Date: Fri, 13 Mar 2015 08:20:33 -0700
>> Subject: Re: Spark config option 'expression language' feedback request
>> From: mridul@gmail.com
>> To: dale__r@hotmail.com
>> CC: dev@spark.apache.org
>>
>> I am curious how you are going to support these over mesos and yarn.
>> Any configure change like this should be applicable to all of them, not
>> just local and standalone modes.
>>
>> Regards
>> Mridul
>>
>> On Friday, March 13, 2015, Dale Richardson <da...@hotmail.com> wrote:
>>
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature to
>> > allow for Spark configuration options (whether on command line, environment
>> > variable or a configuration file) to be specified via a simple expression
>> > language.
>> >
>> >
>> > Such a feature has the following end-user benefits:
>> > - Allows for the flexibility in specifying time intervals or byte
>> > quantities in appropriate and easy to follow units e.g. 1 week rather
>> > rather then 604800 seconds
>> >
>> > - Allows for the scaling of a configuration option in relation to a system
>> > attributes. e.g.
>> >
>> > SPARK_WORKER_CORES = numCores - 1
>> >
>> > SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB
>> >
>> > - Gives the ability to scale multiple configuration options together eg:
>> >
>> > spark.driver.memory = 0.75 * physicalMemoryBytes
>> >
>> > spark.driver.maxResultSize = spark.driver.memory * 0.8
>> >
>> >
>> > The following functions are currently supported by this PR:
>> > NumCores:             Number of cores assigned to the JVM (usually ==
>> > Physical machine cores)
>> > PhysicalMemoryBytes:  Memory size of hosting machine
>> >
>> > JVMTotalMemoryBytes:  Current bytes of memory allocated to the JVM
>> >
>> > JVMMaxMemoryBytes:    Maximum number of bytes of memory available to the
>> > JVM
>> >
>> > JVMFreeMemoryBytes:   maxMemoryBytes - totalMemoryBytes
>> >
>> >
>> > I was wondering if anybody on the mailing list has any further ideas on
>> > other functions that could be useful to have when specifying spark
>> > configuration options?
>> > Regards,Dale.
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

RE: Spark config option 'expression language' feedback request

Posted by Dale Richardson <da...@hotmail.com>.



Thanks for your questions Mridul.
I assume you are referring to how the functionality to query system state works in Yarn and Mesos?
The API's used are the standard JVM API's so the functionality will work without change. There is no real use case for using 'physicalMemoryBytes' in these cases though, as the JVM size has already been limited by the resource manager.
Regards,Dale.
> Date: Fri, 13 Mar 2015 08:20:33 -0700
> Subject: Re: Spark config option 'expression language' feedback request
> From: mridul@gmail.com
> To: dale__r@hotmail.com
> CC: dev@spark.apache.org
> 
> I am curious how you are going to support these over mesos and yarn.
> Any configure change like this should be applicable to all of them, not
> just local and standalone modes.
> 
> Regards
> Mridul
> 
> On Friday, March 13, 2015, Dale Richardson <da...@hotmail.com> wrote:
> 
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature to
> > allow for Spark configuration options (whether on command line, environment
> > variable or a configuration file) to be specified via a simple expression
> > language.
> >
> >
> > Such a feature has the following end-user benefits:
> > - Allows for the flexibility in specifying time intervals or byte
> > quantities in appropriate and easy to follow units e.g. 1 week rather
> > rather then 604800 seconds
> >
> > - Allows for the scaling of a configuration option in relation to a system
> > attributes. e.g.
> >
> > SPARK_WORKER_CORES = numCores - 1
> >
> > SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB
> >
> > - Gives the ability to scale multiple configuration options together eg:
> >
> > spark.driver.memory = 0.75 * physicalMemoryBytes
> >
> > spark.driver.maxResultSize = spark.driver.memory * 0.8
> >
> >
> > The following functions are currently supported by this PR:
> > NumCores:             Number of cores assigned to the JVM (usually ==
> > Physical machine cores)
> > PhysicalMemoryBytes:  Memory size of hosting machine
> >
> > JVMTotalMemoryBytes:  Current bytes of memory allocated to the JVM
> >
> > JVMMaxMemoryBytes:    Maximum number of bytes of memory available to the
> > JVM
> >
> > JVMFreeMemoryBytes:   maxMemoryBytes - totalMemoryBytes
> >
> >
> > I was wondering if anybody on the mailing list has any further ideas on
> > other functions that could be useful to have when specifying spark
> > configuration options?
> > Regards,Dale.
> >

Re: Spark config option 'expression language' feedback request

Posted by Mridul Muralidharan <mr...@gmail.com>.

I am curious how you are going to support these over mesos and yarn.
Any configure change like this should be applicable to all of them, not
just local and standalone modes.

Regards
Mridul

On Friday, March 13, 2015, Dale Richardson <da...@hotmail.com> wrote:

>
>
>
>
>
>
>
>
>
>
>
> PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature to
> allow for Spark configuration options (whether on command line, environment
> variable or a configuration file) to be specified via a simple expression
> language.
>
>
> Such a feature has the following end-user benefits:
> - Allows for the flexibility in specifying time intervals or byte
> quantities in appropriate and easy to follow units e.g. 1 week rather
> rather then 604800 seconds
>
> - Allows for the scaling of a configuration option in relation to a system
> attributes. e.g.
>
> SPARK_WORKER_CORES = numCores - 1
>
> SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB
>
> - Gives the ability to scale multiple configuration options together eg:
>
> spark.driver.memory = 0.75 * physicalMemoryBytes
>
> spark.driver.maxResultSize = spark.driver.memory * 0.8
>
>
> The following functions are currently supported by this PR:
> NumCores:             Number of cores assigned to the JVM (usually ==
> Physical machine cores)
> PhysicalMemoryBytes:  Memory size of hosting machine
>
> JVMTotalMemoryBytes:  Current bytes of memory allocated to the JVM
>
> JVMMaxMemoryBytes:    Maximum number of bytes of memory available to the
> JVM
>
> JVMFreeMemoryBytes:   maxMemoryBytes - totalMemoryBytes
>
>
> I was wondering if anybody on the mailing list has any further ideas on
> other functions that could be useful to have when specifying spark
> configuration options?
> Regards,Dale.
>

Re: Spark config option 'expression language' feedback request

Posted by Imran Rashid <ir...@cloudera.com>.

IMO, spark's config is kind of a mess right now.  I completely agree with
Reynold that Spark's handling of config ought to be super-simple, its not
the kind of thing we want to put much effort in spark itself.  It sounds so
trivial that everyone wants to redo it, but then all these additional
features start to get thrown in, it starts to get complicated.  This is one
of many reasons our config handling is inadequate.  It would be better if
we could outsource it to other libraries, or even better yet, let users
bring their own.

The biggest problem, in my mind, is that there isn't a definitive,
strongly-typed, modular listing of all the parameters.  This makes it
really hard to put your own thing on top -- you've got to manually go
through all the options and put them into your own config library.  And
then make sure its up-to-date with every new release of spark.

Just as a small example of how the options are hard to track down, some of
the options for event logging are listed in SparkContext:
https://github.com/apache/spark/blob/424e987dfebbbaa37f4496d44090d469a931ce76/core/src/main/scala/org/apache/spark/SparkContext.scala#L229

and some others are listed in EventLoggingListener:
https://github.com/apache/spark/blob/424e987dfebbbaa37f4496d44090d469a931ce76/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L60

this also makes it a headache while developing & trying to keep the
documentation up-to-date.

There are a handful of different libraries that might help out with this:
scopt, argot, scallop, sumac.  I'm biased to sumac [since I wrote it ], but
probably any of these would let me do whatever customizations I wanted on
top, without needing to manually keep every option in sync.  That said, I
do think sumac is especially well suited to the way Spark uses
configuration -- the nested structure directly maps to the way we have
things organized currently.  so eg. everything related to event logging
would get placed in a class like:

class EventLoggingOpts {
  var enabled = false
  var compress = false
  var testing = false
  var overwrite = false
  var buffer: Bytes = 100.kilobytes
}

Another plus is that you get fail-fast behavior -- if you put in some
unparseable value, the job will fail immediately, rather than 1 hour in
when you first try to access the value.

In any case, my main point is just that I think we should try to make our
config more compatible with external config tools, rather than trying to
build own.  And after that, I'd just like to throw Sumac into the ring as a
contender :)

On Fri, Mar 13, 2015 at 1:26 PM, Reynold Xin <rx...@databricks.com> wrote:

> This is an interesting idea.
>
> Are there well known libraries for doing this? Config is the one place
> where it would be great to have something ridiculously simple, so it is
> more or less bug free. I'm concerned about the complexity in this patch and
> subtle bugs that it might introduce to config options that users will have
> no workarounds. Also I believe it is fairly hard for nice error messages to
> propagate when using Scala's parser combinator.
>
>
> On Fri, Mar 13, 2015 at 3:07 AM, Dale Richardson <da...@hotmail.com>
> wrote:
>
> >
> > PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature to
> > allow for Spark configuration options (whether on command line,
> environment
> > variable or a configuration file) to be specified via a simple expression
> > language.
> >
> >
> > Such a feature has the following end-user benefits:
> > - Allows for the flexibility in specifying time intervals or byte
> > quantities in appropriate and easy to follow units e.g. 1 week rather
> > rather then 604800 seconds
> >
> > - Allows for the scaling of a configuration option in relation to a
> system
> > attributes. e.g.
> >
> > SPARK_WORKER_CORES = numCores - 1
> >
> > SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB
> >
> > - Gives the ability to scale multiple configuration options together eg:
> >
> > spark.driver.memory = 0.75 * physicalMemoryBytes
> >
> > spark.driver.maxResultSize = spark.driver.memory * 0.8
> >
> >
> > The following functions are currently supported by this PR:
> > NumCores:             Number of cores assigned to the JVM (usually ==
> > Physical machine cores)
> > PhysicalMemoryBytes:  Memory size of hosting machine
> >
> > JVMTotalMemoryBytes:  Current bytes of memory allocated to the JVM
> >
> > JVMMaxMemoryBytes:    Maximum number of bytes of memory available to the
> > JVM
> >
> > JVMFreeMemoryBytes:   maxMemoryBytes - totalMemoryBytes
> >
> >
> > I was wondering if anybody on the mailing list has any further ideas on
> > other functions that could be useful to have when specifying spark
> > configuration options?
> > Regards,Dale.
> >
>

RE: Spark config option 'expression language' feedback request

Posted by Dale Richardson <da...@hotmail.com>.



Hi Reynold,They are some very good questions.
Re: Known libraries
There are a number of well known libraries that we could use to implement this features, including MVEL, OGNL and JBOSS EL, or even Spring's EL.I looked at using them to prototype this feature in the beginning, but they all ended up bringing in a lot of code to service a pretty small functional requirement.The prime requirement I was trying to meet was:
1. Be able to specify quantities in kb,mb,gb etc transparently.2. Be able to specify some options as fractions of system attributes eg cpuCores * 0.8
By just implementing this functionality and nothing else I figured I was constraining things enough that end-users got useful functionality but not enough functionality to shoot themselves in the foot in new and interesting ways. I couldn't see a nice way of limiting the expressiveness of 3rd party libraries to this extent.
I'd be happy to re-look at the feasibility of pulling in one of the 3rd party libraries if you think this approach has more merit, but I do caution that we may be opening a Pandora's box of potential functionality.  Those 3rd party libraries have a lot of (potentially excess) functionality in them.
Re: Code ComplexityI wrote the bare minimum code I could come up with to service the above mentioned functionality, and then refactored it to use a stacked traits pattern which increased the code size by about a further 30%.  The expression code as it stands is pretty minimal, and has more then 120 unit tests proving its functionality. More then half the code that is there is taken up by utility classes to allow easy reference to byte quantities and time units. The design was deliberately limited to meeting the above requirements and not much more to reduce the chance for other subtleties to raise their heads. 
Re: Work arounds.It would be pretty simple to implement fall back functionality to disable expression parsing by:1. Globally having a configuration option to disable all expression parsing and fall back to simple java property parsing.2. Locally having a known prefix that disables expression parsing for that option.This should give enough workarounds to keep things running in the unlikely event that something crops up no matter what happens.
Re: Error messagesIn regards to your comment about nice error messages I would have to agree with you, it would have been nice.  In the end I just return an option[Double] to the calling code for the parsed expression if the entire string is parsed correctly. Given the additional complexity adding error messages involved I retrospectively justify this by saying how much info do you need debug an expression like 'cpuCores * 0.8'? :)
Thanks for the feedback.
Regards,Dale.
> From: rxin@databricks.com
> Date: Fri, 13 Mar 2015 11:26:44 -0700
> Subject: Re: Spark config option 'expression language' feedback request
> To: dale__r@hotmail.com
> CC: dev@spark.apache.org
> 
> This is an interesting idea.
> 
> Are there well known libraries for doing this? Config is the one place
> where it would be great to have something ridiculously simple, so it is
> more or less bug free. I'm concerned about the complexity in this patch and
> subtle bugs that it might introduce to config options that users will have
> no workarounds. Also I believe it is fairly hard for nice error messages to
> propagate when using Scala's parser combinator.
> 
> 
> On Fri, Mar 13, 2015 at 3:07 AM, Dale Richardson <da...@hotmail.com>
> wrote:
> 
> >
> > PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature to
> > allow for Spark configuration options (whether on command line, environment
> > variable or a configuration file) to be specified via a simple expression
> > language.
> >
> >
> > Such a feature has the following end-user benefits:
> > - Allows for the flexibility in specifying time intervals or byte
> > quantities in appropriate and easy to follow units e.g. 1 week rather
> > rather then 604800 seconds
> >
> > - Allows for the scaling of a configuration option in relation to a system
> > attributes. e.g.
> >
> > SPARK_WORKER_CORES = numCores - 1
> >
> > SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB
> >
> > - Gives the ability to scale multiple configuration options together eg:
> >
> > spark.driver.memory = 0.75 * physicalMemoryBytes
> >
> > spark.driver.maxResultSize = spark.driver.memory * 0.8
> >
> >
> > The following functions are currently supported by this PR:
> > NumCores:             Number of cores assigned to the JVM (usually ==
> > Physical machine cores)
> > PhysicalMemoryBytes:  Memory size of hosting machine
> >
> > JVMTotalMemoryBytes:  Current bytes of memory allocated to the JVM
> >
> > JVMMaxMemoryBytes:    Maximum number of bytes of memory available to the
> > JVM
> >
> > JVMFreeMemoryBytes:   maxMemoryBytes - totalMemoryBytes
> >
> >
> > I was wondering if anybody on the mailing list has any further ideas on
> > other functions that could be useful to have when specifying spark
> > configuration options?
> > Regards,Dale.
> >

Re: Spark config option 'expression language' feedback request

Posted by Reynold Xin <rx...@databricks.com>.

This is an interesting idea.

Are there well known libraries for doing this? Config is the one place
where it would be great to have something ridiculously simple, so it is
more or less bug free. I'm concerned about the complexity in this patch and
subtle bugs that it might introduce to config options that users will have
no workarounds. Also I believe it is fairly hard for nice error messages to
propagate when using Scala's parser combinator.


On Fri, Mar 13, 2015 at 3:07 AM, Dale Richardson <da...@hotmail.com>
wrote:

>
> PR#4937 ( https://github.com/apache/spark/pull/4937) is a feature to
> allow for Spark configuration options (whether on command line, environment
> variable or a configuration file) to be specified via a simple expression
> language.
>
>
> Such a feature has the following end-user benefits:
> - Allows for the flexibility in specifying time intervals or byte
> quantities in appropriate and easy to follow units e.g. 1 week rather
> rather then 604800 seconds
>
> - Allows for the scaling of a configuration option in relation to a system
> attributes. e.g.
>
> SPARK_WORKER_CORES = numCores - 1
>
> SPARK_WORKER_MEMORY = physicalMemoryBytes - 1.5 GB
>
> - Gives the ability to scale multiple configuration options together eg:
>
> spark.driver.memory = 0.75 * physicalMemoryBytes
>
> spark.driver.maxResultSize = spark.driver.memory * 0.8
>
>
> The following functions are currently supported by this PR:
> NumCores:             Number of cores assigned to the JVM (usually ==
> Physical machine cores)
> PhysicalMemoryBytes:  Memory size of hosting machine
>
> JVMTotalMemoryBytes:  Current bytes of memory allocated to the JVM
>
> JVMMaxMemoryBytes:    Maximum number of bytes of memory available to the
> JVM
>
> JVMFreeMemoryBytes:   maxMemoryBytes - totalMemoryBytes
>
>
> I was wondering if anybody on the mailing list has any further ideas on
> other functions that could be useful to have when specifying spark
> configuration options?
> Regards,Dale.
>